Pages

Thursday, May 26, 2016

How To Work For Free - The Phony Job Interview Scam



I just returned home from a job interview in Boston.  It wasn't until I sat in the Logan airport gate area for four hours thinking about how frustrating and annoying the interviews went, that I suddenly realized that I had been scammed.   (I won't mention the company name for fear of a lawsuit). 

When searching for a new job opportunity beware of this 'Phony' job interview scam. 

It goes like this: 
  1. A recruiter calls with a  job opportunity for an unusually high dollar amount and really great perks like working remote etc.  The story is told that they have had a really hard time finding qualified candidates with the specific skills you possess and they think you are a great fit. 
  2. You go through a few steps in the process, phone interview, skype etc and then they bring you in for a series of face to face interviews with other members of the team.  In my case, they shelled out money for an airline ticket, so I figured, not only were they serious, but that my chances were really good.  
  3. When you arrive, there is a group of people waiting for you, all are extremely friendly (You think to yourself, this seems like a nice place to work!).
  4. They show you the problem they are having and pump you for ideas on how you would solve the problem.  
  5. Feeling pressured to show your creative and technical abilities, you dig deep to pull out every idea that you can to help solve their problem.
  6. Each time you explain an idea, they respond with, "But how would you solve it?"  They never acknowledge that any of your ideas are good.  In fact, they act as if it is not good enough.   You feel compelled to try harder and dig deeper.
  7. After three or so hours of this continual 'pumping' for ideas, they abruptly end the meeting, thank you for your time and hustle you out the door.
Clues that they were not serious:
  1. No one had seen or read my resume
  2. They show you the problem they are having have and pump you for ideas on how you would solve the problem.  Your ideas are never good enough and they pump you for more.
  3. Thinking you need to prove your value, you deliver more and more ideas for their problem solving.
  4. There is apparently no shortage of the required skills.  In this case, all were knowledgeable about python, algorithms, machine learning, etc.
  5. There is no discussion about the job terms, working conditions, team,  equipment, logistics etc.

So for the cost of a one day travel ticket (under $300 total), they received a wealth of information and ideas about how to solve their problem.  For me, I got $0 pay for my free consulting, plus I sat for four hours on a plane and over six hours in the waiting area at Boston's Logan Airport (flight delays etc). 

Have you had a similar experience?  Does anyone have ways to combat this dirty trick?

All comments are welcome.  Comment below or on twitter @anlytcs





Monday, May 16, 2016

Comparing Daily Stock Market Returns to a Coin Flip


In this post, we examine the Random Walk Hypothesis as applied to daily stock market returns.   Background information on this can be found here: https://en.wikipedia.org/wiki/Random_walk_hypothesis .

So I applied my skills to this problem and came up with a bit of code which attempts to use some feature engineering to predict a coin flip then use the same approach on daily S&P 500  (where 1 is a up day and 0 is a down day).   The next day's outcome is the classification label for the current day.

Program Setup and Feature Generation

You can find my code and data file on GitHub, where you can read it, download it and tweak it until you feel satisfied with the results.
https://github.com/anlytcs/blog_code/tree/master/coin_flip

The features designed for this experiment consisted of:
1. Previous, i.e. was yesterday up or down.
2. A count of Heads while in a Heads streak.
3. A count of Tails while in a Tails streak.
4. A count of Heads and Tails in a set of lookback periods.  That is in the last 5, 10,20,30, etc days how many heads and how many tails were there.  This to capture any observable trends.  (not necessarily valid for coin flips, but believed to be a valuable tool in stock trading.  Here is a bit of code where I identify the feature labels:

LOOKBACKS = [5,10,20,30,40,50,100]
HEADER_LINE = ['label','previous','heads_streak','tails_streak']
for i in LOOKBACKS:
    HEADER_LINE.append('heads_'+str(i))
for i in LOOKBACKS:
    HEADER_LINE.append('tails_'+str(i))

The experiment consisted of three runs:
1. 100,000 pseudo random number coin flips. 
2. 4600 daily observations, going back to 1995, of the S&P 500 index transformed into coin-flips.  That is the up_or_down column in the data file. 
https://github.com/anlytcs/blog_code/blob/master/coin_flip/GSPC_cleaned.csv
Data Source: Yahoo finance.
3. 4600 pseudo random number coin flips.

Each run used a 5-fold cross validation, then plotted the AUC curve for the various runs and averages the AUCs for a final 'score'.  Here are the results:



Output of Run

=======================================================

Start:  Fri May 13 16:18:50 2016
Do 100000 Coin Flips
Counts: Counter({1: 50053, 0: 49947})
Heads: 50.05300 percent
Tails: 49.94700 percent
Build Features:  Fri May 13 16:18:50 2016
Build Model:  Fri May 13 16:19:45 2016
Train and Do Cross Validation:  Fri May 13 16:19:45 2016
[ 0.49705482  0.496599    0.50169547  0.49761318  0.49758308]
Average:  0.498109110082
Accuracy: 0.4981 (+/- 0.003664)




=======================================================

Do SP500 (4600 days)
Build Features:  Fri May 13 16:20:09 2016
Build Model:  Fri May 13 16:20:10 2016
Train and Do Cross Validation:  Fri May 13 16:20:10 2016
[ 0.53477333  0.54759786  0.53681652  0.55128712  0.53932995]
Average:  0.54196095771
Accuracy: 0.5420 (+/- 0.012769)







=======================================================

Do 4600 Coin Flips
Counts: Counter({1: 2322, 0: 2278})
Heads: 50.47826 percent
Tails: 49.52174 percent
Build Features:  Fri May 13 16:20:33 2016
Build Model:  Fri May 13 16:20:34 2016
Train and Do Cross Validation:  Fri May 13 16:20:34 2016
[ 0.48760588  0.52287443  0.52944808  0.5124338   0.50364302]
Average:  0.511201041299
Accuracy: 0.5112 (+/- 0.029456)
End:  Fri May 13 16:20:51 2016






=======================================================


Commentary and Conclusions


1. As expected, the 100,000 coin flips run shows exactly what you would expect.  With an Average AUC of 0.4981, this is almost the definition of the ROC curve.  https://en.wikipedia.org/wiki/Receiver_operating_characteristic .  For 100k flips and all the generated features, the 5 ROC curves basically follow the 45 degree line.  So, there was no benefit found over random guessing what the next flip will be.

2. For  4600 days of S&P 500 'flips', there appears to be a very slight edge from the model, with an average AUC of 0.5419609577.  Not enough to risk actual money.

3. Now, the 4600 coin flips output raises some interesting questions.    The average AUC was 0.5112 and the five ROC curves somewhat hug the 45 degree line.  This raises the question of the validity of the 0.54 found in run #2  (4600 days SP500) ?  I wonder what would happen if we had 100,000 days S&P 500 data.  That is about 380 years worth.  Maybe some motivated enough, could dig up 100,000 hourly readings try this experiment again.  I would be curious to see the results.

Have we disproven the Random Walk Hypothesis?  No.  There are much better mathematical minds than myself who have effectively put that theory to rest.  An interesting and thoroughly enjoyable read on this subject is the Misbehavior of Markets by Benoit Mandelbrot.  http://www.amazon.com/Misbehavior-Markets-Fractal-Financial-Turbulence/dp/0465043577/ref=asap_bc?ie=UTF8
You can also read the abstract here:    http://users.math.yale.edu/users/mandelbrot/web_pdfs/getabstract.pdf

What I think we have shown:
1. Machine learning cannot predict a coin toss  (but we knew that already).
2. Next day stock price forecasting is a hard problem.


Feedback?  Hit me up on Twitter @anlytcs


Wednesday, April 20, 2016

Stock Forecasting with Machine Learning - Are Stock Prices Predictable?




In the last two posts, I offered a "Pop-Quiz" on predicting stock prices.  Today, I would like to ask the most important issue when attempting to use any form of predictive analytics in the financial markets.  Do you even have a chance of getting reliable results?  Or are you wasting your time?  Back in 2003, when I first built the described Neural Network solution, it was my first naive take on the problem and I wasted a lot of time.

Today, with the expansion of machine learning research and mathematical techniques combined with the proliferation of open source tools, we are in a much better position to answer these questions directly.   A few months back a new algorithm came to my attention via an interesting post on the FastML blog entitled "Are Stocks predictable".   Check this link:  http://fastml.com/are-stocks-predictable/

The short story is this: A PhD student at Carnegie Mellon University named Georg Goerg developed an algorithm and published his findings in what he called 'Forecastable Component Analysis'.  This algorithm looks at a time-series and tries to determine how much noise vs. how much signal.  The answer is provided as an 'Omega Score'.  The algorithm was also provided as an R package ForeCA.  

In English, if the data contains too much noise, attempts to predict the series will fail.  This is really useful for stock prices.  FastML shows that next day %changes for stock indexes have ridiculously low Omega scores, between 1.25% and 6%.  Not enough to bank on.

I discovered a similar effect in my research.  No matter how much you torture the input data, forecasting the next day's close is a fool's folly.    It is analogous to attempting to predict the flip of a coin.  However, what I have discovered (assuming I am interpreting the results correctly), is that as you go out in time, the results start to become more meaningful.  So, what would happen if you fed the ForeCA algorithm with Percentage change values for 1,5,10,15,20,25, and 30 days in the future ?

Here are the results.   Note: ForeCA reorders the columns from most to least forecastable (after transformation), so for the sake of simplicity, just pay attention to the 'Orig' series' omega scores and the top right bar chart. (bars labeled X1-Day through X30_DAY).   As you can see, the noise/signal ratio and your ability to forecast improves as the number of days increases.  


$Omega
Series 7  Series 4  Series 5  Series 6  Series 3  Series 2  Series 1
31.998529 28.954507 25.660565 23.572059 20.275582 11.857304  4.612705


$Omega.orig
   X1_DAY    X5_DAY   X10_DAY   X15_DAY   X20_DAY   X25_DAY   X30_DAY
 1.632106 11.253286 18.363721 22.831144 26.353855 29.138379 31.560240






 

Again, assuming I am interpreting the results correctly, we have a 31.56% chance of getting the forecast right 30 days in the future. Still not enough to bank on.  In the end, stock market success is not about the perfect algorithm or forecast or formula.  It is about managing risk when your signal goes wrong.

(Note: I would have provided the R source code and input data, but it was left on my work laptop when I recently finished up a project with Cisco).











Friday, May 29, 2015

Stock Forecasting With Machine Learning - Seven Possible Errors



Here are at least seven reasons why pumping the last ten days of SP500 O/H/L/C/V into a neural network in an attempt to solve for the next day's O/H/L/C is a bad idea. 

1. Not enough data.  Ten days of data is simply not enough.
2. No feature engineering.  The plan used raw data.  A better approach might be to use to solve for percentage gain or loss.  How about daily range as a converted to a percentage volatility value?  How about Volume spike true/false?
3. No separate train vs. test set.  You have no way of determining the accuracy of the model on unseen data.
4. If you are training and predicting during in a trending market, the neural network is being asked to solve for values outside its known range of values.  Not a task that is well suited for a Neural Network.
5. Separate Neural Network should be used to solve for multiple output values.  While some algorithms can be constructed to solve for multiple targets, the Neural Network is not one of them.
6. The Neural Network is a very brittle and opaque algorithm.  Sometimes it does not converge at all and when/if it does, it is very difficult to understand the results of the model.
7. Attempting to forecast next day's numbers based on a series of End-of-Day values is a fool's folly.    For more information refer to this book: The (Mis)Behaviour of Markets: A Fractal View of Risk, Ruin and Reward Book by Benoit Mandelbrot
 

However, on the plus side, there was an important kernel of truth here.  The notion that the model needs to be adaptable to current market conditions.  This is the bane of many black box or mechanical trading systems.  If the model does not adapt to current conditions, the best it can do is average over long periods which are likely not suitable for today's market.

Bottom line.  This model is crap.  "Operation Make Millions" was naive and ill advised.  It never made $10.00 !

Feedback?  Hit me up on Twitter @anlytcs



Wednesday, October 29, 2014

Stock Forecasting With Machine Learning - Pop-Quiz


A few years back, I decided that machine learning algorithms could be designed to forecast the next day's Open, High Low, Close for the SP500 index.   Armed with that information, it would be a cinch to make $Millions !

The following chart shows the initial design of the Neural Network:




As it turns out, this ML model did not work.  In fact, this approach is completely wrong !  Can you think this through and come up with reasons why ?

(Note: This slide was taken from a recent presentation entitled "Building Effective Machine Learning Applications"). 

Tuesday, October 28, 2014

Hire a Data Scientist for only $5.00

Got a Machine Learning problem? Got $5.00? Consider it solved !

Visit this page on Fiverr.com:

https://www.fiverr.com/aliabbasjp/solve-a-machine-learning-and-intelligence-problem-to-gather-insights-from-your-data?context=adv.cat_10.subcat_143&context_type=auto&funnel=2014102811022631011277080

While I don't have any knowledge of the quality of their work, I suspect this price is reflective of the cost of living in their location minus a discount for the promotional benefit.

 The Internet is the great equalizer.

Thursday, September 25, 2014

Understanding Online Learning - Part 1

Online learning is form of machine learning with the following characteristics:

1. Supervised learning
2. Operates on streams of big data
3. Fast, lightweight models
4. Small(er) RAM footprint
5. Updated continuously
6. Adaptable to changes in the environment

Many machine learning algorithms train in batch mode.  The model requires the entire batch of training data to be fed in at one time.  To train, you select an algorithm, prepare your batch of data, train the model on the entire batch, check the accuracy of your predictions.  You then fine tune your model by iterating your process and by tweaking your data, inputs and parameters.   Most algorithms do not allow new batches of data to update and refine old models.  So periodically you may need to retrain your models with the old and new data.

There are a number of benefits to the batch approach:
  1. Many ML algorithms to choose from.  You have many more algorithms because that is typically how they are developed at the universities and the batch approach aligns with traditional statistics practices. 
  2. Better accuracy.   Since the batch represents the "known universe", there are many mathematical techniques which have been developed to improve model accuracy.
  3. Can be effective with smaller data sets.  Hundreds or thousands of rows can results in good ML models.  (Internally, many algorithms iterate over the data set to learn the desired characteristics and improve the results). 
Online learning takes a stream approach to learning.  Instead of processing the entire batches of data, the online learning algorithm sees one row at a time from a larger stream, runs a prediction, checks the error rate and updates the model continuously.   (In a production setting, you may not have the true target value immediately, so you may need to split the predict and update phases into separate processes).

There are some advantages and a few drawbacks to the online learning approach.   

Advantages:
  1. Big Data: Extremely large data sets are difficult to work with.  Model development and algorithm training is cumbersome.  With online learning, you can wrestle the data down to manageable sized chunks and feed it in..
  2. Small(er) RAM footprint.  Obvious benefits of using less RAM.
  3. Fast: Because they have to be.
  4. Adaptive:   As new data comes, the learning algorithm adjusts the model and automatically adapts to the changes in the environment.  This is useful for keeping your model in sync with changes in human behavior such as click-thru behavior and financial markets etc.  With traditional algorithms using a batch approach, the newer behavior is blended in with the older data so these subtle changes in behavior are lost.  With online learning, the model continuously moves toward latest version of reality.
Drawbacks:
  1. It requires a lot of data.  Since the learning is done as it goes along, the model accuracy is developed over millions of rows not thousands.  (You should pre-train your model before production use, of course).
  2. Predictions as not as accurate.  You give up some accuracy in the predictive powers of the model as a trade off for the speed and size of the solution.
In the Part 2, we'll point out some of the existing tools which can be used for Online Learning.  In later posts, I will attempt to relate the concept of adaptability to the field of Black Box Stock Trading systems.   Black box trading systems are notorious because they make money at first and then fail miserably when the market morphs into something totally different.

Meanwhile, here are some interesting links to learn more:

http://en.wikipedia.org/wiki/Online_machine_learning

http://www.youtube.com/watch?v=HvLJUsEc6dw

http://www.microsoft.com/en-us/showcase/details.aspx?uuid=436006d9-4cd5-44d4-b582-a4f6282846ee

Enjoy !

Monday, August 11, 2014

Naked Short Selling - An Introduction

Imagine an Asset that can be created from thin air, cost nothing to produce, can be created in unlimited quantities, can be easily sold with the push of a button for big money and the seller keeps all the money, forever.  Does this sound like an ideal way to make a lot of money?  Does this sound illegal?  As a matter of fact it is!

Welcome to the world of Naked Short Selling

Without getting into the basics or the ethics of Short Selling, I'll just say that it is a common practice in most financial markets.  In essence, the "Short Seller" is betting the price will go down (i.e. They Sell first and Buy back later).  This is usually perfectly legal although it can be risky.  (For detailed information go here:  http://en.wikipedia.org/wiki/Naked_short_selling ).    However, before you can understand the crime, first you need to understand some distinctions:

With Futures and Options: 
  1. You are buying and selling a legal Contract (with specific rights and obligations).
  2. These Contracts are created at will between the buyers and the sellers (by design).
  3. They (either the clearing house or the exchange) keeps track of the Open Interest (number of Contracts outstanding).
  4. There is a time limit (expiration date), when everything needs to be settled.
  5. There is a clearing house to hold the trader legally accountable to the terms of their Contract(s).
  6. There is a mechanism to take money out or put money into your account on a nightly basis (futures) or at expiration (options).
  7. To short a Futures or Option Contract, all you need to do is push a button.  This is perfectly legal and by design.
  8. To close the position, you simply hit the Buy button or wait for expiration.
With Stocks:
  1. You are Buying and Selling an Asset.  (Representing a fraction of a corporation or a limited partnership etc.).
  2. The number of shares outstanding is controlled by the corporation.
  3. There is no time limit.  You could hold your IBM shares for 20 years.
  4. Stock trades are typically settled in three days (US).  Money changes hands and so do the stock shares.
  5. To Short a share of stock, you must first "borrow" the stock from someone else and then sell the stock.
  6. To close the position, you simply hit the Buy button (which then theoretically returns the shares to the lender).
Naked Short Selling ONLY occurs on Stocks (or other Assets), not on Futures or Options contracts.  Stocks are supposed to be a limited in number.  There are mechanisms in place, in the securities market, to prevent shorting a stock which cannot be borrowed.  For the average investor, if there are no shares available to borrow, you simply cannot short it.  This ensures that the number of shares outstanding remains constant. 

However, the 'big boys' play by a different set of rules.  There are certain legal exemptions for certain market participants (market makers etc) and there is also a lack of SEC enforcement against other market participants.  They can short the stock which they have not yet borrowed as long as they promise to deliver the borrowed shares in before settlement (three days).   Sometimes they don't deliver.

Think of the implications of this practice (especially if this behavior is performed by criminals and/or psychopaths):
  1. They are creating new shares out of thin air (which are supposed to be a limited Asset)
  2. They sell them to unsuspecting buyers who think they have a real Asset in their account. 
  3. Nobody knows who did it.
  4. The Buyer may not know about the fraud for many years (or ever).   All the Buyer ever knows is that the price keeps going down.
  5. If they create and sell enough shares they can drive the stock price down to $0 (and then they never have to buy it back !). 
Over the last decade, there have been a number of government attempts to curb the practice.  Not necessarily because it is outright fraud, but because it lowers the 'confidence' in our financial markets.  They held hearings and passed some new rules.  For example, if the shares are not delivered when they are supposed to, a report needs to be submitted to the SEC.   They have also removed some of the legal exemptions.  However, there still appears to be some exempt 'legal' naked short selling and the illegal and abusive naked short selling.  

Does Illegal Naked Short Selling occur today? Probably yes, but I am no expert in these things just an outside observer.   So, just out of curiosity, I picked up a few stock tickers from the NYSE website which had "Failure to Deliver" reports  (http://www1.nyse.com/regulation/memberorganizations/Threshold_Securities.shtml ) and viewed their price charts. 

Here are a couple of stocks which are appear to be diving relentlessly into the ground: END, USU, WLT.  I cannot tell if these stocks are dropping because of their deteriorating business conditions or due to naked short selling. But it is interesting that they are all dropping fairly consistently for three years straight.  (Keep in mind there MUST be bounces along the way in order to trick more victims into thinking the bottom is in and therefore commit to buying some/more shares).

WARNING:  I WOULD NOT BUY OR SHORT THESE STOCKS.  This is not investment advice, just a bit of education on some of the dirty ticks you need to know about in order to protect yourself.



















 

It's Been Quiet on the Blog Lately


It's been quiet around here lately, so I have decided to expand the scope of this blog beyond the basic theme of Machine Learning.   

There are a lot of interesting technology topics out there, sometimes related to machine learning and data science (but sometimes not).  In coming months, some areas which I plan to cover include:
  • Financial Markets
  • Black Box and Adaptive Trading Systems
  • Job Markets
  • Business Optimization with Data Science
  • Open Source Software
I want to continue publishing original content, rather than links to other people's interesting stuff.  (If you want links, there are some great link sites out there already).

Cheers,
Steve

Wednesday, March 5, 2014

Top Ten Reason To "Kaggle"



Do you aspire to do Machine Learning, Data Science, or Big Data Analytics?  If so, you have probably studied, taken courses, read a bunch of blog posting and can code up some R, Python or Matlab.  

Are you ready to start solving real world problems?  Probably not.    It is one thing to know some things about data, it is a very different situation altogether to effectively solve real world problems.   So how do you improve your skills?

I highly recommend you take a look at Kaggle Competitions.  Kaggle.com hosts Data Science/Machine Learning competitions on their site.  They offer a wide range of challenging problems with a fixed deadline and with the element of competition.  Also, there is usually a modest financial incentive for the winner(s), although I am surprised at how meager most of the prizes are considering the amount of work invovled and the benefit they derive from crowdsourcing their problems.  But the real benefit is not in the prize money, it is in the learning process.

Kaggle has been a tremendous learning experience to expand my depth and breadth of knowledge.  Here's why:

1. Kaggle exposes you to a wide range of Machine Learning problems: Forecasting, Sentiment Analysis, Natural Language Processing, Image Recognition, etc.    This motivates you learn about as much as you can about the problem domain, the type of data involved, and the various algorithms which might be applicable. 

2. Kaggle is under a time limit.  This "forces" you to work in a very efficient manner in developing and testing out alternative ideas quickly.   When under pressure and motivated to score highly on a competition, you will focus and learn more techniques in a very short time frame.

3. Kaggle competitions "force" you to code and recode your solution in the most resource efficient manner possible, making tradeoffs between programmer time, CPU time, RAM etc.    In order to compete, you to need to discover and remove performance bottlenecks quickly.  This enable you to improve turnaround time for subsequent iterations.

4.  Each competitions uses a different scoring mechanisms.  You will learn about the various scoring metrics and when they are used.  You will probably code some of these yourself. 

5. You will surely learn the value of Cross-validation.  Re-sampling and retraining your model multiple times to validate that your solution is working and not overfitting the data.

6. You will learn new methods for dealing with dirty data:  Cleaning, filtering, handling missing values etc.   Sometimes the competition planners intentionally throw
garbage into the data sets in order to make the challenge harder.

7. You will sometimes be handling massive file sizes, putting you to the challenge of slicing, sampling, splitting, extracting and zipping useful subsets of the data.

8.  Each competition has a forum where competitors help each other tackle the problem.  There is a really supportive atmosphere for learning and exploring in the Kaggle forums.    At the conclusion of the competition, there is a massive learning opportunity as the participants "open their kimonos" and  share their best work for solving the problem.  The more intimate knowledge you have of the problem, the better you will understand the thought process they went through and will take notes for the next competition.

9. You will be competing against some of the best Data Scientist in the world.  This competition brings out the best you have in yourself.  If you are mediocre in your approach, it will show in your results.  Your Kaggle Leaderboard ranking is immediate feedback on how well you have broken down and solved the problem.  You can't lie to yourself, the final leaderboard shows where you stand.

10. You will come to realize there is more to machine learning than just pushing data through a library algorithm.  If all Kaggle competitors have access to the same libraries of algorithms and tools, what differentiates the solutions?   How do you win?   You can do your best work and still find 200-300 people with higher scores on the Kaggle leaderboard.   The leaderboard scoring focuses all your energy on the primary objective:  Improving the overall score of your solution.  It can be tough.  Kaggle competitors are some of the most brilliant minds on the planet.

11.  After you have scored highly in a number of competitions (A top ten finalist and a top 10% placement) you can earn the coveted "Kaggle Master" badge.

12.  Recruiters are scouring the Kaggle boards looking for talented Data Scientists.  You could find a new position. 


For all of these reasons, Kaggle works to bring out the best talent within you.     If you really want to become expert in Data Science and Machine Learning, you should consider Kaggle competitions.

(Yes, I know that was actually twelve reasons, but Ten makes a better headline  :-)


Here is a screen shot of the Loan Default Prediction Leaderboard