Pages

Monday, May 16, 2016

Comparing Daily Stock Market Returns to a Coin Flip


In this post, we examine the Random Walk Hypothesis as applied to daily stock market returns.   Background information on this can be found here: https://en.wikipedia.org/wiki/Random_walk_hypothesis .

So I applied my skills to this problem and came up with a bit of code which attempts to use some feature engineering to predict a coin flip then use the same approach on daily S&P 500  (where 1 is a up day and 0 is a down day).   The next day's outcome is the classification label for the current day.

Program Setup and Feature Generation

You can find my code and data file on GitHub, where you can read it, download it and tweak it until you feel satisfied with the results.
https://github.com/anlytcs/blog_code/tree/master/coin_flip

The features designed for this experiment consisted of:
1. Previous, i.e. was yesterday up or down.
2. A count of Heads while in a Heads streak.
3. A count of Tails while in a Tails streak.
4. A count of Heads and Tails in a set of lookback periods.  That is in the last 5, 10,20,30, etc days how many heads and how many tails were there.  This to capture any observable trends.  (not necessarily valid for coin flips, but believed to be a valuable tool in stock trading.  Here is a bit of code where I identify the feature labels:

LOOKBACKS = [5,10,20,30,40,50,100]
HEADER_LINE = ['label','previous','heads_streak','tails_streak']
for i in LOOKBACKS:
    HEADER_LINE.append('heads_'+str(i))
for i in LOOKBACKS:
    HEADER_LINE.append('tails_'+str(i))

The experiment consisted of three runs:
1. 100,000 pseudo random number coin flips. 
2. 4600 daily observations, going back to 1995, of the S&P 500 index transformed into coin-flips.  That is the up_or_down column in the data file. 
https://github.com/anlytcs/blog_code/blob/master/coin_flip/GSPC_cleaned.csv
Data Source: Yahoo finance.
3. 4600 pseudo random number coin flips.

Each run used a 5-fold cross validation, then plotted the AUC curve for the various runs and averages the AUCs for a final 'score'.  Here are the results:



Output of Run

=======================================================

Start:  Fri May 13 16:18:50 2016
Do 100000 Coin Flips
Counts: Counter({1: 50053, 0: 49947})
Heads: 50.05300 percent
Tails: 49.94700 percent
Build Features:  Fri May 13 16:18:50 2016
Build Model:  Fri May 13 16:19:45 2016
Train and Do Cross Validation:  Fri May 13 16:19:45 2016
[ 0.49705482  0.496599    0.50169547  0.49761318  0.49758308]
Average:  0.498109110082
Accuracy: 0.4981 (+/- 0.003664)




=======================================================

Do SP500 (4600 days)
Build Features:  Fri May 13 16:20:09 2016
Build Model:  Fri May 13 16:20:10 2016
Train and Do Cross Validation:  Fri May 13 16:20:10 2016
[ 0.53477333  0.54759786  0.53681652  0.55128712  0.53932995]
Average:  0.54196095771
Accuracy: 0.5420 (+/- 0.012769)







=======================================================

Do 4600 Coin Flips
Counts: Counter({1: 2322, 0: 2278})
Heads: 50.47826 percent
Tails: 49.52174 percent
Build Features:  Fri May 13 16:20:33 2016
Build Model:  Fri May 13 16:20:34 2016
Train and Do Cross Validation:  Fri May 13 16:20:34 2016
[ 0.48760588  0.52287443  0.52944808  0.5124338   0.50364302]
Average:  0.511201041299
Accuracy: 0.5112 (+/- 0.029456)
End:  Fri May 13 16:20:51 2016






=======================================================


Commentary and Conclusions


1. As expected, the 100,000 coin flips run shows exactly what you would expect.  With an Average AUC of 0.4981, this is almost the definition of the ROC curve.  https://en.wikipedia.org/wiki/Receiver_operating_characteristic .  For 100k flips and all the generated features, the 5 ROC curves basically follow the 45 degree line.  So, there was no benefit found over random guessing what the next flip will be.

2. For  4600 days of S&P 500 'flips', there appears to be a very slight edge from the model, with an average AUC of 0.5419609577.  Not enough to risk actual money.

3. Now, the 4600 coin flips output raises some interesting questions.    The average AUC was 0.5112 and the five ROC curves somewhat hug the 45 degree line.  This raises the question of the validity of the 0.54 found in run #2  (4600 days SP500) ?  I wonder what would happen if we had 100,000 days S&P 500 data.  That is about 380 years worth.  Maybe some motivated enough, could dig up 100,000 hourly readings try this experiment again.  I would be curious to see the results.

Have we disproven the Random Walk Hypothesis?  No.  There are much better mathematical minds than myself who have effectively put that theory to rest.  An interesting and thoroughly enjoyable read on this subject is the Misbehavior of Markets by Benoit Mandelbrot.  http://www.amazon.com/Misbehavior-Markets-Fractal-Financial-Turbulence/dp/0465043577/ref=asap_bc?ie=UTF8
You can also read the abstract here:    http://users.math.yale.edu/users/mandelbrot/web_pdfs/getabstract.pdf

What I think we have shown:
1. Machine learning cannot predict a coin toss  (but we knew that already).
2. Next day stock price forecasting is a hard problem.


Feedback?  Hit me up on Twitter @anlytcs


No comments:

Post a Comment

Please be civilized :-}

Note: Only a member of this blog may post a comment.