Last weekend I participated in the Data Science London group's hackathon.
The challenge was to take some data provided by EMI and use it to build a
recommender system that could predict how much a user would like a track
based on previous ratings, demographic data and some interview responses.
When I arrived at the event I grouped up with some guys from a company called
Musicmetric. The team then eventually split into
two groups, a guy called Ben and I worked on the recommender system problem. The
rest of the Musicmetric team started working on building visualisations with the data.
The hackathon officially started at 1pm on Saturday London time, and went on until 1pm the
next day. I was one of a small group of people that survived the entire 24 hours, with most of
the participants going home late on Saturday evening/early on Sunday morning. Food was provided
which was excellent, and this allowed us to focus entirely on the problem. As a tea
drinker I was slightly disappointed by the quality of the tea, but everything else was really good.
The hackathon took place in The HUB Westminster, which is a really nice work space. It is light and airy
and there were even some rooms left intentionally dark for crashing in (I slept on a beanbag for about 2
hours, and would recommend that if you go to a future hackathon you take a thermarest/camping mattress).
The problem was hosted on the Kaggle platform, which provides training and test data
and takes classified test data and evaluates it behind the scenes, giving you an output score. You can
see the scores of all the other participants, and within seconds of the competition starting a solution
had been posted that was very good. This was probably due to the data set being released before the
competition started, and someone training a really strong classifier ahead of time, testing it in cross
validation and then running their solution against the data and submitting. The evaluation
criteria for the problem was RMSE which means we
have to focus on minimising the overall distance between our solutions and the correct answer, as opposed
to the number of instances we get correct.
Our first solution to the problem was to apply simple collaborative filtering to the problem,
this seemed like an obvious approach because we're trying to build a recommendation system given
a bunch of input (user,item,rating) triples and a bunch of user,item pairs to predict. The RMSE of
this approach in cross validation was about 22 (out of 100) with a result of roughly 18 on the actual
We were given a lot of demographic information for each of the users, and it seemed to make sense to
attempt to break our approach down by demographic bins. Trying various combinations of the demographic
information we were given, however, yielded no gain in cross validation or against the actual testing data.
After racking our brains for a while we came up with the idea of using a random forest ensemble method to
solve the problem: shoving all the demographic, interview response and other data in and having the forest
classify in a brute force manor. This solution was implemented with roughly 2 hours to go until the end of
the competition. Knowing we did not have long to run our solutions we started with a very rough and ready
approach and jumped several places in the rankings. Excited we started running a number of different random forest solutions
with different parameters to try and find which parameter gave us the best jump. After determining that tree
height was going to give the best results we set two classifiers runnning with different tree heights on each of our laptops.
They both finished and we submitted them with a minute and 20 seconds to go until the end of the competition. We jumped
all the way to third place, which was really exciting. The person who won the competition used the exact same approach as we
did, but had been running it since the start of the competition which suggests that we may well have been able to win
if we had more time to fiddle with the solution parameters.
Thoughts about the data
The data we were provided with by EMI contained a lot of information. We found, however, that the demographic
information did not improve our classification accuracy at all. There are a couple of conclusions we could
draw from this. The first is that music taste is not effected by age, gender, region or any of the other
information we were provided with. I'm not sure I believe that 94 year old males have the same
listening tastes as 16 year old females, so I'm going to reject this conclusion.
The more likely conclusion is that there wasn't enough data
provided for demographic information to help. Every time you split by demographic you reduce the size
of your training and validation sets. This means that the accuracy of the individual classifiers are
reduced, and as such the accuracy of the overall classifier of all the bins is also reduced. Given a couple
of order of magnitudes more data it might well have been the case that we were able to produce an accurate
classifier based on demographic information.
I had a great time at the Data Science Hackathon, I would very much like to participate in another one in the future.
There were prizes, free t-shirts, free good food and really excellent people who understand a lot more about
machine learning and data mining than I do. I'm really really glad that I went. I'd like to give a special shout out
to Ben for being an awesome teammate, Greg for being supportive overnight when I began to burn out and Carlos for running
things and just being a generally awesome dude.