Thursday, September 25, 2014

Understanding Online Learning - Part 1

Online learning is form of machine learning with the following characteristics:

1. Supervised learning
2. Operates on streams of big data
3. Fast, lightweight models
4. Small(er) RAM footprint
5. Updated continuously
6. Adaptable to changes in the environment

Many machine learning algorithms train in batch mode.  The model requires the entire batch of training data to be fed in at one time.  To train, you select an algorithm, prepare your batch of data, train the model on the entire batch, check the accuracy of your predictions.  You then fine tune your model by iterating your process and by tweaking your data, inputs and parameters.   Most algorithms do not allow new batches of data to update and refine old models.  So periodically you may need to retrain your models with the old and new data.

There are a number of benefits to the batch approach:
  1. Many ML algorithms to choose from.  You have many more algorithms because that is typically how they are developed at the universities and the batch approach aligns with traditional statistics practices. 
  2. Better accuracy.   Since the batch represents the "known universe", there are many mathematical techniques which have been developed to improve model accuracy.
  3. Can be effective with smaller data sets.  Hundreds or thousands of rows can results in good ML models.  (Internally, many algorithms iterate over the data set to learn the desired characteristics and improve the results). 
Online learning takes a stream approach to learning.  Instead of processing the entire batches of data, the online learning algorithm sees one row at a time from a larger stream, runs a prediction, checks the error rate and updates the model continuously.   (In a production setting, you may not have the true target value immediately, so you may need to split the predict and update phases into separate processes).

There are some advantages and a few drawbacks to the online learning approach.   

  1. Big Data: Extremely large data sets are difficult to work with.  Model development and algorithm training is cumbersome.  With online learning, you can wrestle the data down to manageable sized chunks and feed it in..
  2. Small(er) RAM footprint.  Obvious benefits of using less RAM.
  3. Fast: Because they have to be.
  4. Adaptive:   As new data comes, the learning algorithm adjusts the model and automatically adapts to the changes in the environment.  This is useful for keeping your model in sync with changes in human behavior such as click-thru behavior and financial markets etc.  With traditional algorithms using a batch approach, the newer behavior is blended in with the older data so these subtle changes in behavior are lost.  With online learning, the model continuously moves toward latest version of reality.
  1. It requires a lot of data.  Since the learning is done as it goes along, the model accuracy is developed over millions of rows not thousands.  (You should pre-train your model before production use, of course).
  2. Predictions as not as accurate.  You give up some accuracy in the predictive powers of the model as a trade off for the speed and size of the solution.
In the Part 2, we'll point out some of the existing tools which can be used for Online Learning.  In later posts, I will attempt to relate the concept of adaptability to the field of Black Box Stock Trading systems.   Black box trading systems are notorious because they make money at first and then fail miserably when the market morphs into something totally different.

Meanwhile, here are some interesting links to learn more:

Enjoy !