Monday, January 6, 2014

Lightning Fast Python?!?

This is how I reduced my data crunching process time from 12 hours down to only 20 minutes using Python.

Feature generation is the process of taking raw data and boiling it down
to a "feature matrix" that machine learning algorithms typically require.

In the Kaggle's Biometric Accelerometer Competition , the train
data was 29.5 million rows and the test data was just over 27 million
rows. Bringng the total raw data to about 56,500,000 rows of smartphone
accelerometer readings.

My initial feature generation code used the standard approach to machine
learning in Python: Pandas, Scikit-Learn, etc. It was taking about 12
hours (train and test). Ugh.

I went searching for a faster solution What were my options?

1. Rusty C language skills
2. Learn another language: Julia which is supposed to be very fast
(still on my todo list).
3. Try Cython a form of python that "sort of" compiles to C.
4. What else was there???

The Answer was PyPy.

PyPy got its start as a version of Python written in Python. At first,
this seemed kind of interesting for compiler people but not what I
needed. Then I learned that the PyPy team has been putting a lot of
effort into their JIT Compiler. A Just-In-Time (JIT) compiler converts
your code to machine language the first time it touches your code.
After that, it runs at machine speeds. The result is blazingly fast
Python! See

There is a drawback: Many Machine Learning libraries do not run on it.
I had to remove all Pandas, Numpy, Scikit. So I broke my problem into
two steps: Feature generation in PyPy and Machine Learning in
Python/Pandas/SciKit. After that I was slicing and dicing
accelerometer readings like crazy. More importantly, I was iterating my
solution faster. Allowing me to finish 26th out of 633 teams (top 4%)!

Hopefully over time, more ML libraries will be ported to PyPy (I think
Numpy is working on it). For now, here is a list of packages which are
either known to work or not work with PyPy

Below is a code snippet for those who want to try it. What you need to
run this code:

1. Install PYPY
2. Change file permissions to allow execution
3. Run it from the command line: ./

import os
import csv

parseStr = lambda x: float(x) if '.' in x else int(x)

allData = []

def feature_gen():
    global allData
    # do something here


reader = csv.reader(f)
header = # strip off the csv header line
for row in reader:
    convertedRow = []
    for token in row:
            newTok = parseStr(token)
        except (ValueError):
            print token
            raise ValueError


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.