Pages

Wednesday, March 5, 2014

Top Ten Reason To "Kaggle"



Do you aspire to do Machine Learning, Data Science, or Big Data Analytics?  If so, you have probably studied, taken courses, read a bunch of blog posting and can code up some R, Python or Matlab.  

Are you ready to start solving real world problems?  Probably not.    It is one thing to know some things about data, it is a very different situation altogether to effectively solve real world problems.   So how do you improve your skills?

I highly recommend you take a look at Kaggle Competitions.  Kaggle.com hosts Data Science/Machine Learning competitions on their site.  They offer a wide range of challenging problems with a fixed deadline and with the element of competition.  Also, there is usually a modest financial incentive for the winner(s), although I am surprised at how meager most of the prizes are considering the amount of work invovled and the benefit they derive from crowdsourcing their problems.  But the real benefit is not in the prize money, it is in the learning process.

Kaggle has been a tremendous learning experience to expand my depth and breadth of knowledge.  Here's why:

1. Kaggle exposes you to a wide range of Machine Learning problems: Forecasting, Sentiment Analysis, Natural Language Processing, Image Recognition, etc.    This motivates you learn about as much as you can about the problem domain, the type of data involved, and the various algorithms which might be applicable. 

2. Kaggle is under a time limit.  This "forces" you to work in a very efficient manner in developing and testing out alternative ideas quickly.   When under pressure and motivated to score highly on a competition, you will focus and learn more techniques in a very short time frame.

3. Kaggle competitions "force" you to code and recode your solution in the most resource efficient manner possible, making tradeoffs between programmer time, CPU time, RAM etc.    In order to compete, you to need to discover and remove performance bottlenecks quickly.  This enable you to improve turnaround time for subsequent iterations.

4.  Each competitions uses a different scoring mechanisms.  You will learn about the various scoring metrics and when they are used.  You will probably code some of these yourself. 

5. You will surely learn the value of Cross-validation.  Re-sampling and retraining your model multiple times to validate that your solution is working and not overfitting the data.

6. You will learn new methods for dealing with dirty data:  Cleaning, filtering, handling missing values etc.   Sometimes the competition planners intentionally throw
garbage into the data sets in order to make the challenge harder.

7. You will sometimes be handling massive file sizes, putting you to the challenge of slicing, sampling, splitting, extracting and zipping useful subsets of the data.

8.  Each competition has a forum where competitors help each other tackle the problem.  There is a really supportive atmosphere for learning and exploring in the Kaggle forums.    At the conclusion of the competition, there is a massive learning opportunity as the participants "open their kimonos" and  share their best work for solving the problem.  The more intimate knowledge you have of the problem, the better you will understand the thought process they went through and will take notes for the next competition.

9. You will be competing against some of the best Data Scientist in the world.  This competition brings out the best you have in yourself.  If you are mediocre in your approach, it will show in your results.  Your Kaggle Leaderboard ranking is immediate feedback on how well you have broken down and solved the problem.  You can't lie to yourself, the final leaderboard shows where you stand.

10. You will come to realize there is more to machine learning than just pushing data through a library algorithm.  If all Kaggle competitors have access to the same libraries of algorithms and tools, what differentiates the solutions?   How do you win?   You can do your best work and still find 200-300 people with higher scores on the Kaggle leaderboard.   The leaderboard scoring focuses all your energy on the primary objective:  Improving the overall score of your solution.  It can be tough.  Kaggle competitors are some of the most brilliant minds on the planet.

11.  After you have scored highly in a number of competitions (A top ten finalist and a top 10% placement) you can earn the coveted "Kaggle Master" badge.

12.  Recruiters are scouring the Kaggle boards looking for talented Data Scientists.  You could find a new position. 


For all of these reasons, Kaggle works to bring out the best talent within you.     If you really want to become expert in Data Science and Machine Learning, you should consider Kaggle competitions.

(Yes, I know that was actually twelve reasons, but Ten makes a better headline  :-)


Here is a screen shot of the Loan Default Prediction Leaderboard






Tuesday, January 28, 2014

Machine Learning Skills Pyramid V1.0


While the exact definition of "Data Scientist" continues to elude us, the job requirements seem to heavily include machine learning skills.  They also include a wide range of other skills, ranging from specific languages, frameworks, databases etc, to data cleaning,  web scraping, visualizations, mathematical modeling and subject matter expertise.  (This breakdown will be the subject of a future post, as I was having some trouble with my web scraper ;))

So for the typical "Data Scientist" role, many organizations want PhD level academic training plus an assortment of nuts and bolt programming or database skills.  Most of these job requirements are like a rich and complex mix of "can't find the right candidate"  (aka Unicorn).   So, as an extension to the Data Science Venn Diagram V2.0, I thought it would be helpful to  try to clarify and make some important distinctions regard Machine Learning skills.

Back in the 2002-2003 time frame, I spent a bunch of time trying to code my own Neural Networks.  This was a very frustrating experience because bugs in these algorithms can be especially difficult to find and it took time away from what I really wanted to do, which is building applications using machine learning.  So I decided back then to use well tested and fully debugged library algorithms over clunky home grown algorithms whenever possible.   These days there are so many powerful and well tested ML libraries, why would anyone write one from scratch?  The answer is, sometimes a new algorithm is needed.

First, some definitions will help clarify:
  1. ML Algorithm: A well defined, mathematically based tool for learning from inputs.  Typically found in ML libraries.  Take the example of sorting algorithms:  BubbleSort, HeapSort InsertionSort, etc.  As a software developer, you do not want or need to create a new type of sort.  You should know which works best for your situation and use it.  The same applies to Machine Learning:  Random Forests, Support Vector Machines, Logistic Regression, Backprop Neural Networks etc, are all algorithms which are well known, have certain strengths and limitations and are available in many ML libraries and languages.  These are a bit more complicated than sorting, so there is more skill required to use them effectively.
  2. ML Solution:  An application which uses one or more ML Algorithms to solve a business problem for an organization (business, government etc).
  3. ML Researcher/Scientist:  PhD's are at the top of the heap.  They have been trained to work on leading edge problems in Machine Learning or Robotics etc.  These skills are hard won and are will suited for tackling problems with no known solution.  When you have a new class of problems which require insight and new mathematics to solve, you need an ML Researcher.  When they solve a problem a new ML Algorithm will likely emerge.
  4. ML Engineer:  Is a sharp software engineer with experience in building ML Solutions (or solving Kaggle problems).  The ML Engineer's skills are different from the ML Researcher.  There is less abstract mathematics and more programming, database and business acumen involved.  An ML Engineer analyzes the data available, the organizational objectives and the ML Algorithms known to operate on this type of problem and this type of data.  You can't just feed any data into any ML Algorithm and expect a good result.  Specialized skills are required in order to create high scoring ML solutions.  These include: Data Analysis, Algorithm Selection, Feature Engineering, Cross Validation, appropriate scoring and trouble shooting the solution.
  5. Data Engineer:  A software engineer with platform and language specific skills.  The Data Engineer is a vital part of the ML Solution team.  This person or group does the heavy lifting when it comes to building data driven systems. The are so many languages, databases, scripting tools, operating systems each with its own set of quirks, secret incantations and performance gotchas.   A Data Engineer needs to know a broad set of tools and be effective in getting the data extracted, scraped, cleaned, joined, merged and sliced for input to the ML Solution.  Many of the skills needed to manage Big Data, belong in the Data Engineer category.
With that, I give you the Machine Learning Skills Pyramid V1.0:

(Click Image to Enlarge)

Sunday, January 26, 2014

Stock Forecasting with Machine Learning



Almost everyone would love to predict the Stock Market for obvious reasons.    People have tried everything from Fundamental Analysis, Technical Analysis, and Sentiment Analysis to Moon Phases, Solar Storms and Astrology.

However, unless you are in a position to front run other people's trades, like High Frequency Trading,  there is no such thing as a guaranteed profit in the markets.  The problem with human stock analysis is that there is so much data and so many variables that it is easy for the average human to become overwhelmed, get sucked down the rabbit hole and continue to make sub-optimal choices. 

Sounds like a job for Machine Learning and there is no shortage of people and companies trying this as well.   One major pitfall is that most ML algorithms do not work well with stock market type data.  This also results in a lot of people of wasting a lot of time.  But In order to share some of the concepts and get the conversation started I am posting some of my findings regarding Financial and Stock Forecasting using Machine Learning

I trained 8000 machine learning algorithms to develop a probabilistic future map of the stock market in the short term (5-30 days) and have compiled a list of the stocks most likely to bounce in this time frame.  There is no single future prediction.  Instead there is a large set of future probabilities which someone can use to evaluate their game plan and portfolio.   My exact methods remain proprietary at this time (but might consider institutional licensing).

Here are the "Stock Picks" based on how they closed on Friday (Jan 24, 2014) based on the stock's individual trading behavior:

GE - General Electric
GM - General Motors
HON - Honywell
DIS - Disney
MET - MetLife
NKE - Nike
OXY - Occidental Petroleum
BK- Bank of New York Mellon
EMR - Emerson Electric
TWX - Time Warner Inc.
FCX - Freeport McMoran Copper and Gold

Disclaimer:  This is not trading or investing advice. It is simply the output of my ML system.  If you lose money, do not come crying. Trade at your own risk!

Since, the market got pummelled this week, there are a lot of stocks that look like 'buys' right now.  But the overall (US) market is coming off a very prolonged euphoric period and it has not had a significant correction for over two years.   So, it is possible that the current downswing is either a minor pullback a.k.a. "dip", or it is the start of a major correction.

Here are the charts.   For the most part they look like a big sell-off in an larger uptrend.   It is always interesting to see how the future unfolds and especially with respect to these predictions.  Also, keep in mind, even if a stock does bounce, it could then run out of steam and drop again.  Ah...life in the uncertainty zone ;). 

Enjoy!
































Monday, January 6, 2014

Data Science Venn Diagram v2.0


There have been a number of attempts to get our collective brains around all the skill sets needed to effectively do Data Science.

Here are two...

1.  http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
2.  http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png


Below is my take on the subject.  The center is marked "Unicorn".  This a reference to the recent discussions in the press and blogosphere indicating that Data Scientists are as hard to find as unicorns.  Finally the mindset is changing that a team of people with complimentary skills is the course of action for most data driven organizations.  Certainly some individuals might posses Computer Science, Statistics and Subject Matter Expertise. They are just very hard to find.   Many Data Scientist job descriptions don't reflect this reality and so these positions go unfilled for six months or more. 

Let me know what you think...

CLICK TO ENLARGE



This is an Adaptation of the original Data Science Venn Diagram which is licensed under Creative Commons Attribution-NonCommercial.

Lightning Fast Python?!?

This is how I reduced my data crunching process time from 12 hours down to only 20 minutes using Python.

Feature generation is the process of taking raw data and boiling it down
to a "feature matrix" that machine learning algorithms typically require.

In the Kaggle's Biometric Accelerometer Competition
http://www.kaggle.com/c/accelerometer-biometric-competition , the train
data was 29.5 million rows and the test data was just over 27 million
rows. Bringng the total raw data to about 56,500,000 rows of smartphone
accelerometer readings.

My initial feature generation code used the standard approach to machine
learning in Python: Pandas, Scikit-Learn, etc. It was taking about 12
hours (train and test). Ugh.

I went searching for a faster solution What were my options?

1. Rusty C language skills
2. Learn another language: Julia which is supposed to be very fast
(still on my todo list).
3. Try Cython a form of python that "sort of" compiles to C.
4. What else was there???

The Answer was PyPy. http://www.pypy.org/

PyPy got its start as a version of Python written in Python. At first,
this seemed kind of interesting for compiler people but not what I
needed. Then I learned that the PyPy team has been putting a lot of
effort into their JIT Compiler. A Just-In-Time (JIT) compiler converts
your code to machine language the first time it touches your code.
After that, it runs at machine speeds. The result is blazingly fast
Python! See http://speed.pypy.org/

There is a drawback: Many Machine Learning libraries do not run on it.
I had to remove all Pandas, Numpy, Scikit. So I broke my problem into
two steps: Feature generation in PyPy and Machine Learning in
Python/Pandas/SciKit. After that I was slicing and dicing
accelerometer readings like crazy. More importantly, I was iterating my
solution faster. Allowing me to finish 26th out of 633 teams (top 4%)!

Hopefully over time, more ML libraries will be ported to PyPy (I think
Numpy is working on it). For now, here is a list of packages which are
either known to work or not work with PyPy
https://bitbucket.org/pypy/compatibility/wiki/Home

Below is a code snippet for those who want to try it. What you need to
run this code:

1. Install PYPY
2. Change file permissions to allow execution
3. Run it from the command line: ./gen_features.py



#!/usr/bin/pypy
import os
import csv

parseStr = lambda x: float(x) if '.' in x else int(x)

allData = []

def feature_gen():
    global allData
    # do something here
    return

os.nice(9)

f=open('train.csv')
reader = csv.reader(f)
header = reader.next() # strip off the csv header line
for row in reader:
    convertedRow = []
    for token in row:
        try: 
            newTok = parseStr(token)
        except (ValueError):
            print token
            raise ValueError
    convertedRow.append(newTok)
allData.append(convertedRow)
f.close()

feature_gen()



An Easy Way to Bridge Between Python and Vowpal Wabbit



Python is a great programming language.  It is has a clean syntax, tremendous user community support, and excellent machine learning libraries.  Unfortunately it is SLOW!  So, when the situation calls for it, I prefer to drop down to machine code to run the actual machine learning algorithm.

One fast and amazing Machine Learning tool that I have used on a number of projects is Vowpal Wabbit.  It was developed by researchers at Yahoo! Research and later at Microsoft Research.  It has support for many types of learning problems, automatically consumes/vectorizes text, can do recommendations, predictions, classifications, (single and multi-class), supports namespaces, instance weighting, and the list goes on.

VW Homepage: http://hunch.net/~vw/
VW Wiki:  https://github.com/JohnLangford/vowpal_wabbit/wiki
(The Wiki is better for finding all the functions and how to use it)

There are also a few Python wrappers for Vowpal Wabbit:
1. Vowpal Porpoise:  https://github.com/josephreisinger/vowpal_porpoise
2. PyVowpal:  https://github.com/shilad/PyVowpal

The problem with wrappers is that they don't always expose all the features you want to use. Vowpal has a lot of features.  So, after a bit of hemming and hawing, I did a "slash and burn" then wrote what I needed.  This is how I currently use Vowpal Wabbit with Python.  Instead of a wrapper, I offer you code snippets which can be tailored to your specific needs.

This code assumes you know how to use Python and Pandas.  It runs on linux and uses the matrix factorization feature (recommendation engine) of Vowpal.

Performance:  With over 43 million rows, it took about 16 minutes to generate the inputs in the Pandas DataFrame, but only 9 minutes to train with 20 passes.  (I7-2600K)

Enjoy!

Steve Geringer



##########################################################################
#  Here are the essential ingredients.  You'll have to fill in the rest...;)
##########################################################################

import os
from time import asctime, time
import subprocess
import csv
import numpy as np
import pandas as pd
.
.
.
#############################################################
# Parameters and Globals
#############################################################
environmentDict=dict(os.environ, LD_LIBRARY_PATH='/usr/local/lib')   
# Hat Tip to shrikant-sharat for this secret incantation
# Note: only needed if you rebuilt vowpal and the new libvw.so is in /usr/local/lib

parseStr = lambda x: float(x) if '.' in x else int(x)


#############################################################
# Vowpal Wabbit commands
#############################################################
"""
WARNING:  MAKE SURE THERE ARE NO EXTRA SPACES IN THESE COMMAND STRINGS...IT GIVES A BOOST::MULTIPLE OPTIONS ERROR
"""
trainCommand = ("vw --rank 3 -q ui --loss_function=squared --l2 0.001 \
--learning_rate 0.015 --passes 20 --decay_learning_rate 0.97 --power_t 0 \
-d train_vw.data --cache_file vw.cache -f vw.model -b 20").split(' ')        

predictCommand = ("vw -t -d test_vw.data -i vw.model -p vw.predict").split(' ')

.
.
.

#############################################################
# Generate the VW Train/Test data format in a Pandas DataFrame using the apply method
#############################################################
def genTrainInstances(aRow): 
    userid = str(aRow['userid'])
    urlid = str(aRow['urlid'])
    y_row = str(int(float(aRow['rating']))  )
    rowtag = userid+'_'+urlid
    rowText = (y_row + " 1.0  " + rowtag + "|user " + userid +" |item " +urlid) 
    return  rowText
   
def genTestInstances(aRow): 
    y_row = str(0)
    userid = str(aRow['userid'])
    urlid = str(aRow['urlid'])
    rowtag = userid+'_'+urlid
    rowText = (y_row + " 1.0  " + rowtag + "|user " + userid +" |item " +urlid)
    return  rowText
.
.
.
#############################################################
# Function to read the VW predict file, strip off the desired value and return a vector with results
#############################################################
def readPredictFile():
    y_pred = []
    with open('vw.predict', 'rb') as csvfile:
        predictions = csv.reader(csvfile, delimiter=' ', quotechar='|')
        for row in predictions:
            pred = parseStr(row[0])
            y_pred.append(val)
    return np.asarray(y_pred) 
.
.
.
#############################################################
# Function to train a VW model using DataFrame called df_train
# - Apply genTrainInstances
# - Write newly create column to flat file
# - Invoke Vowpal Wabbit for training
#############################################################
def train_model():
    global df_train, trainCommand, environmentDict

    print "Generating VW Training Instances: ", asctime()
    df_train['TrainInstances'] = df_train.apply(genTrainInstances, axis=1)
    print "Finished Generating Train Instances: ", asctime()

    print "Writing Train Instances To File: ", asctime()
    trainInstances = list(df_train['TrainInstances'].values)
    f = open('train_vw.data','w')
    f.writelines(["%s\n" % row  for row in trainInstances])
    f.close()
    print "Finished Writing Train Instances: ", asctime()

    subprocess.call(trainCommand, env=environmentDict)
    print "Finished Training: ", asctime()     
    return
.
.
.
#############################################################
# Function to test a VW model using DataFrame df_test
# - Apply genTestInstances
# - Write new column to flat file
# - Invoke Vowpal Wabbit for prediction
#############################################################
def predict_model():
    global environmentDict, predictCommand, df_test

    print "Building Test Instances: ", asctime()
    df_test['TestInstances'] = df_test.apply(genTestInstances, axis=1)
    print "Finished Generating Test Instances: ", asctime()

    print "Writing Test Instances: ", asctime()
    testInstances = list(df_test['TestInstances'].values)
    f = open('test_vw.data','w')
    f.writelines(["%s\n" % row  for row in testInstances])
    f.close()
    print "Finished Writing Test Instances: ", asctime()

    subprocess.call(predictCommand, env=environmentDict)

    df_test['y_pred'] = readPredictFile()
    return



Welcome

Hello and Welcome!

I am a software consultant and have been involved with Machine Learning since
2002. A friend of mine and fellow Machine Learning enthusiast, Rohit
Sivaprasad of http://www.DataTau.com suggested I start a blog to share some of my
ideas and tips with the data science community.

Here you go !

Steve Geringer