Machine Learning in Finance: Some Models and Examples

Letian Wang
10 min readAug 31, 2020

--

I decided to write a story discussing some machine learning practices in Finance I see online. There are a lot of articles and books about this topic. This post is different in that the concepts described here may not be completely correct or mathematically tight. It is just my current understanding of machine learning and data science, and I’m prepared to come back updating some concepts based on feedback and new developments.

Before machine learning, people used to tell computers what to do, for example, if price breaks n-day consolidation range, go buy at the breakout. With the advent of big data and artificial intelligence, machines are expected to learn the trading rules by themselves.

Machine learning can roughly be categorized into supervised learning, unsupervised learning, and reinforcement learning, while Deep Learning can fall into any one of them.

Some general concepts before diving into supervised learning.

  1. Training set and test set, such as 80%/20% split, is to fit a model on the training set and then evaluate how it performs on the test set. Sometimes a validation set is added to tune hyperparameters. In the case of time series, this practice can be expanded into k-fold cross-validation, walk-forward cross-validation, and combinatorial cross-validation.
  2. Bias vs variance. If the data generator is quadratic with randomness, then a linear model will underfit training data and is said to have a high bias. On the other hand, a cubic curve will overfit training data, having lower bias but higher variance, which means the model adapts too much to the past experience and doesn’t render well itself into the future. This is closely related to type I II error or false positive/false negative. In my opinion, it is better to not act on a false negative than to see an overfit or high variance strategy triggers on a trading signal that later turns out to be a false positive.
  3. Martingale vs markovian. They measure different things. Martingale is commonly seen in risk-neutral pricing and tells if a game is fair, while markovian is used in dynamic programming and Markov Decision Process (MDP) in reinforcement learning. It assumes current state discounts all information. If you flip a biased coin, it is markovian but not a martingale. If you flip a biased coin twice and get paid if two flips are on the same side, it is martingale but not markovian (although can be redesigned as markovian by including two flips).
  4. Entropy and cross-entropy. Entropy measures the uncertainty of a probability distribution or the similarity across its outcomes. If you toss a coin, it follows the Bernoulli distribution. The outcome is the most uncertain and most alike when the coin is fair; otherwise, you’ll have a better chance by betting on the biased side. Therefore entropy peaks at 0.5 as shown by this quadratic curve on Wikipedia. Cross-entropy measures the similarity not with itself but with another distribution. So two coins have the smallest cross-entropy if the second coin is as biased as the first. Kullback-Leiber (KL) distance does the same thing. It is defined as cross-entropy minus entropy (notice the negative sign in the definition of entropy and cross-entropy). When entropy of the first coin is given, minimizing KL distance is equivalent to finding a second coin that minimizes cross-entropy with the first one, or essentially to find a coin that is as biased as the first. Last but not least, if we think of N-sample training set as the first coin and model prediction as the second coin, then maximizing log-likelihood is the same as bringing prediction as close to the training set as possible, or in other words, minimizing cross-entropy. This is the logic behind the loss function of logistic regression and softmax regression.
  5. Latent variable and state-space model. The underlying markovian process can be fully observed or partially observed. The latent state can be an exchange order book and observation is NBBO or 200ms snapshot, or latent state is economy in expansion or contraction and observation is quarterly GDP revised or monthly jobless claim. Model-based methods try to estimate the environment transitions at the same time, examples are VAR, HMM, LSTM, with the help of probabilistic programming packages. Model-free methods learn trading policies based on prices and volumes observations without bothering to know how the underlying market evolves, such as DQN or PPO in reinforcement learning.

Supervised Learning and Unsupervised Learning

Supervised learning studies relationships between features x and label y, for instance, x can be today’s OHLCV prices and y be tomorrow’s forecast. If you only care about if tomorrow is red or green, it becomes a classification problem. If you also want to know what tomorrow’s return looks like, it becomes a regression problem. There is no clear cut between them. Some regression models, e.g. logistic, are used as categorical (imagine return > 0 be an up day). This kind of models are used to be discriminative, i.e., fit the conditional p(y|x) directly, but now Bayesian style generative models start to become popular.

Unsupervised learning does not use label y. It tries to find patterns in x. One application is clustering that splits observations into groups based on similarities. This is different from classification in supervised learning where a classification model is trained ready before splitting, in other words, taking two steps. Another application of unsupervised learning is association mining.

  1. Linear regression, a supervised parametric discriminative regression model and mother of all models. Examples are CAPM and APT factor models. It generally uses returns to avoid spurious correlations between non-stationarity price series unless x and y are cointegrated. It uses MLE to pick up the parameter that supports the highest probability of samples. Some extensions include copula-based regression or neural networks for non-linear relationships, GLM for heteroscedasticity, stepwise for multicollinearity, and regularization techniques such as L1 Lasso or L2 Ridge or L1/L2 Elastic Net to prevent overfits. I tried a three-factor regression here.
  2. Logistic regression and Naive Bayesian. They are both supervised classifiers. Logistic regression uses the sigmoid function to transform y into probabilities. Sigmoid is the binary case of softmax function, and log-loss is the binary case of cross-entropy. Unlike the former, Naive Bayesian can be multinomial. It assumes features independence. So if you have three features, one and two are perfectly interchangeable, then Native Bayesian will give them combined weight 66%, while logistic regression will detect this multicollinearity and give combined weight 50%. Naive Bayesian is also generative, meaning if you have a spam email detector, it is capable of composing new spam emails. An immediate application is to classify tomorrow as an up day or a down day, as the example here.
  3. Support Vector Machine (SVM). SVM supervised classifier selects the line that has the greatest margin from the support vector points. It uses the inner-product kernel trick to project datasets to a suitable (higher) dimension. It can also be used as regression, to find the line that encompasses most points in its epsilon boundary. An example here.
  4. Decision Tree and Random Forest. Supervised non-parametric decision tree tries to construct a top-down decision flow, for example, from the top root, is RSI < 30? If yes, is volume > two-week average? and so on, until at the bottom leaf, a trading decision is made. Important features are usually located close to the top. A single decision tree is usually unstable to adding a new observation, so a bagging technique called random forecast is constructed to let each tree in the forest take a democratic plurality vote for whether to send an order or not. Similarly, average depth across trees measures the importance of a feature. An example here.
  5. K-means and K-nearest neighbours (KNN). The former is an unsupervised non-parametric clustering method. It iteratively pins k key points or centroids, and then cluster their neighbours around centroids until converge, as demonstrated here. The latter is a supervised classification method and explained here. Stocks can be grouped by K-means similarity instead of by sectors. A monthly rebalance strategy based on factor neighbours is backtested in [2].
  6. Principal Component Analysis (PCA) is an unsupervised learning technique for dimension reduction or feature extraction. It uses a linear transformation to rotate original data to the orthogonal axis, as opposed to non-linear techniques such as Autoencoder (AE) or Restricted Boltzmann machines (RBM) discussed later in deep learning part. A use case in portfolio management is to treat the first principal as the market portfolio so the second or third eigen-portfolio is supposed to be orthogonal to the market. My note here.
  7. Ensemble Learning. It organizes multiple predictors together. There is bagging (Boostrap aggregation) seen in the random forest that uses parallel trees to reduce mainly the variance by averaging, and pasting, as opposed to bootstrapping, that samples training instances without replacement. There is also boosting that focuses on reducing base model bias (as well as variance). Common boosting algorithms include AdaBoost (Adaptive Boosting), where trees are built sequentially and more attentions are given to previously underfit hard cases; or Gradient boosting, that fits against previous residual errors where the next predictor is achieved by taking gradient descent on the previous predictor in function space to handle general residual formats besides L1 MAE (Mean Absolute Error) and L2 MSE (Mean Square Error). A good simple illustration can be found here. Two popular gradient boosting algorithms are XGBoost and LightGBM. Similar to bagging, the gradient becomes stochastic by drawing sub-samples from the training set to reduce correlations between two trees. In stacking (stacked generalization) a model called blender is used to do the voting aggregation [1].
  8. Hidden Markov Chain (HMM) and Gaussian Mixture (GMM). Both are model-based state-space models and solved by expectation-maximization algorithms. Start with a prior, the expectation step estimates the posterior via conditional likelihood (prior → posterior); then in maximization step, the prior is updated with new estimates based on maximized posterior (posterior → prior). This loop continues until the parameters converge. Once the underlying states are estimated, a timing strategy can be executed based on current states and transition probabilities. I tried HMM here and GMM here.

Deep Learning and Reinforcement Learning

Deep Learning uses multi-layer artificial neural networks (ANN) to solve machine learning problems. It consists of a forward-propagation step to forecast and backpropagation step with chain rule to auto-differentiate gradients. Some techniques such as He initialization, batch normalization, gradient clipping are applied to alleviate the vanishing or exploding gradient issues persist in a deep network. Unlike fully connected multi-layer perception (MLP), a convolutional neural network (CNN) exploits spatial local correlation and drastically decreases parameters by spatial weight sharing. A recurrent neural network (RNN) exploits temporal local correlation and drastically decreases parameters by temporal weight sharing. It also introduces a memory cell representing all past states to make the model approximately Markovian. An autoencoder (AE) goes through a short middle layer serving as dimension reduction if input data is recovered at the output layer. A generative adversarial network (GAN) constructs one net for generation and the other for discrimination and the two learn together. At last, the multiple-layer structure makes deep learning ready for transfer learning. An example of RNN for stock forecasting here.

Reinforcement Learning studies the interaction between environment and agent. In the centre is the Bellman equation. According to [4], there are two sets of Bellman equations, one for expectation and the other for optimality. In each set, there are four equations, s to s, s to a to s, a to a, and a to s to a, where a stands for actions and s stands for states. So in total, there are eight equations. The key is to solve the Bellman equation iteratively. There are generally three categories of methods to do so, Policy gradient Actor, Q-function Critic, or Actor-Critic, and sampling can be either Monte Carlo or Temporal Difference (TD)[4]. Some techniques such as replay buffer and the target network are commonly used to speed up convergence, and importance sampling to theoretically support off-policy. I tried some models in TF2 here. Sometimes it’s easier than reading a package such as OpenAI baseline here and stable baseline here.

Deep learning research boomed since 2015. [5] has figures to visualize how many papers have been published under which topics and journals, be sure to check out at least those graphs. [5] has done a good categorization work so I won’t repeat. Below I just touch on some selected fields and their online resources.

  1. Financial text mining and sentiment analysis. Arguably one of the most promising field in financial deep learning is the ability to process alternative data such as text, image, and speech. It’s an obvious edge to parse and understand President Trump’s tweets before others. There are lots of start-ups to provide for example satellite images or real-time sensor data on mall traffics as listed in [2]and here. There are two kinds of news, one is scheduled such as EIA inventory report, the other is unscheduled, such as political events abrupted in the middle east and both will impact oil prices so your strategy should be prepared for both. Natural Language Processing (NLP) packages can help such as an example here.
  2. Algorithmic trading and market-making. Algorithmic trading generally builds on top of better price predictions. Here an example to apply deep reinforcement to both commodity futures and index markets. Market-makers face additional challenges such as adverse-selection, optimal execution, and inventory risk. Back to the main street, this project collects a selection of forecasting models and agents. Another promising medium story is here.
  3. Portfolio management and asset pricing. It addresses issues such as factor selection and asset allocation. An example here constructs a smart index enhanced replication via autoencoding 25 underlying stocks. Another example here compared empirically different machine learning models, from simple OLS to neural networks, on a rich factor space and complete stock space. An interesting example here builds a framework to train model-free reinforcement agent to learn under multi-stage MDP environment with policy being rebalancing weights. The authors kindly open-sourced their codes here and also a reimplementation using Amazon SageMaker here. On the asset pricing front, an example of option pricing here.

DISCLAIMER: The author does NOT promise any future profits and does NOT take responsibility for any trading losses on any referenced strategies.

--

--