Time Series

# ​​What are the differences between time series models(E.g. ARIMA) and the Wiener process?

We use the Wiener process for modeling stock prices. What are the differences between this model and time series models when the observations are stock prices? Let time series model be AR model, MA model, or ARIMA model.

The main difference is that the Wiener process (or Brownian motion) is indexed by R_+, that is, the positive line: it can be observed at any time. Time series models are discrete, that is, they are indexed by N, the natural numbers. For instance, the random walk with Bernoulli jumps is observed for t=0,1,2,3,... However using a time and space scaling, it can be shown that the Brownian motion is a limit of the random walk, see for instance
In Finance, the use of continuous time models is justified by the fact that high frequency trading makes time almost continuous. But more importantly, these models are tailored for log-returns, which aggregate very well in time. Most results of option pricing theory are given in continuous-time, but they often have discrete time counterparts.
In practice, one can only observe discrete points. Usually, people consider that is is simply a finite sample of a process that has a continuum of values which cannot be all observed. But these points can also be considered as a discrete process.
In the end, it all depends on the purpose of what you want to do. For option pricing, usually, it's continuous time processes, but for prediction for instance, very complex econometric (discrete) models are probably very competitive. For portfolio optimization, both exist.

Guy from south africa really good notes.

Finally, compare you results to say (Theta+ARIMA+ETS)/3 - this may be a humbling experience :-)

Hey, I recently researched time series classification for a personal project. I'm going to dump some of my notes here, with the caveat that I'm not by any means an expert in the area.
First, the most helpful summary paper I've found in this area is The Great Time Series

Here's another awesome resource with open source implementations and datasets to play with: http://timeseriesclassification.com/index.php
PREPROCESSING
• Z-normalization transforms time series values so that the mean is ~0, and the standard deviation is ~1. See https://jmotif.github.io/sax-vsm_site/morea/algorithm/znorm.html "in some cases, this preprocessing is not recommended as it introduces biases. For example, if the signal variance is significantly small, z-normalization will simply overamplify the noise to the unit of amplitude."
• Piecewise Aggregate Approximation (PAA) - It's kind of like a histogram where you add up all the values in a window, but you scale it down by averaging them together. Think "avg pooling". There's a special distance measure that reduces the complexity, so it's performant. SAX (see below) uses PAA to transform the time series to a string of words.
SUPERVISED ML
• Some neural models are already well-suited for spatial input out of the box, like RNNs/LSTMs, or CNNs
INSTANCE-BASED CLASSIFICATION
• Dynamic time-warping - this is often used as a baseline, because it's very simple and apparently performs well. Uses dynamic programming to align the two sequences according to their closest fit. It's hella slow, but there are ways to speed it up like FastDTW and the LB Keogh lower bound. Con: apparently performance degrades with long time series, relatively short features of interest, and moderate noise
• Weighted DTW - adds a multiplicative weight penalty based on the warping distance between points in the warping path
• Time Warp Edit distance (TWE) - an elastic distance metric that gives you control via a stiffness parameter, v. Stiffness enforces a multiplicative penalty on the distance between matched points in a manner similar to WDTW.
• Move-Split-Merge (MPM) - Kind of like string edit distance, but for parts of a time series.
• Derivative DTW (DDTW) - a weighted combination of raw series and first-order differences for NN classification with full-window DTW.
SVM + String Kernels
Most of these methods were originally applied to gene protein classification, but they should generalize.
• k-spectrum kernel - The kernel essentially asks "how many subsequences do they have in common?" The vector space is the set of all possible k-mers, and each value is 1 if the sequence contains the given subsequence, otherwise 0. The kernel function compares two examples by taking the inner product.
• Mismatch kernel - Similar to the k-spectrum, but allow for approximate matches by looking for subsequences within a local edit distance neighborhood.
• Spatial representation kernel - Similar to k-spectrum, but matches don't have to be contiguous, so ABCD matches with ABZZCD.
• Fisher kernel - I had a harder time following this one, but I think it trains a HMM on positive classes, then computes a feature vector for each example by taking the gradient of the HMM's model parameters at that point, and train a classic SVM in this new feature space.
SHAPELETS
Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. You can use the distance to the shapelet, rather than the distance to the nearest neighbor to classify objects. Shapelets are local features, so they're robust to noise in the rest of the instance. They're also phase-invariant: location of a shapelet has no baring on the classification.
Basically, a random forrest is trained, where each split point of the tree is a shapelet. You slide a window across training examples, looking for shapelets (subsequences) that split the dataset in such a way that maximizes information gain.
• Logical shapelets - multiple shapelets are combined in logic expressions
• Fast shapelets - Instead of a full enumerative search at each node, the fast shapelets algorithm discretises and approximates the shapelets using a dictionary of SAX words.
• Shapelet transform - separates the shapelet discovery from the classifier by finding the top k shapelets on a single run (in contrast to the decision tree, which searches for the best shapelet at each node). The shapelets are used to transform the data, where each attribute in the new dataset represents the distance of a series to one of the shapelets. Then you can train a new model on top of this dataset.
• Learned shapelets - adopts a heuristic gradient descent shapelet search procedure rather than enumeration. LS finds k shapelets that, unlike FS and ST, are not restricted to being subseries in the training data.