Music Data Science Hackathon
Last weekend I participated in the Data Science London group's hackathon. The challenge was to take some data provided by EMI and use it to build a recommender system that could predict how much a user would like a track based on previous ratings, demographic data and some interview responses.
When I arrived at the event I grouped up with some guys from a company called Musicmetric. The team then eventually split into two groups, a guy called Ben and I worked on the recommender system problem. The rest of the Musicmetric team started working on building visualisations with the data.
The hackathon officially started at 1pm on Saturday London time, and went on until 1pm the next day. I was one of a small group of people that survived the entire 24 hours, with most of the participants going home late on Saturday evening/early on Sunday morning. Food was provided which was excellent, and this allowed us to focus entirely on the problem. As a tea drinker I was slightly disappointed by the quality of the tea, but everything else was really good.
The hackathon took place in The HUB Westminster, which is a really nice work space. It is light and airy and there were even some rooms left intentionally dark for crashing in (I slept on a beanbag for about 2 hours, and would recommend that if you go to a future hackathon you take a thermarest/camping mattress).
The problem was hosted on the Kaggle platform, which provides training and test data and takes classified test data and evaluates it behind the scenes, giving you an output score. You can see the scores of all the other participants, and within seconds of the competition starting a solution had been posted that was very good. This was probably due to the data set being released before the competition started, and someone training a really strong classifier ahead of time, testing it in cross validation and then running their solution against the data and submitting. The evaluation criteria for the problem was RMSE which means we have to focus on minimising the overall distance between our solutions and the correct answer, as opposed to the number of instances we get correct.
Our first solution to the problem was to apply simple collaborative filtering to the problem, this seemed like an obvious approach because we're trying to build a recommendation system given a bunch of input (user,item,rating) triples and a bunch of user,item pairs to predict. The RMSE of this approach in cross validation was about 22 (out of 100) with a result of roughly 18 on the actual test set.
We were given a lot of demographic information for each of the users, and it seemed to make sense to attempt to break our approach down by demographic bins. Trying various combinations of the demographic information we were given, however, yielded no gain in cross validation or against the actual testing data.
After racking our brains for a while we came up with the idea of using a random forest ensemble method to solve the problem: shoving all the demographic, interview response and other data in and having the forest classify in a brute force manor. This solution was implemented with roughly 2 hours to go until the end of the competition. Knowing we did not have long to run our solutions we started with a very rough and ready approach and jumped several places in the rankings. Excited we started running a number of different random forest solutions with different parameters to try and find which parameter gave us the best jump. After determining that tree height was going to give the best results we set two classifiers runnning with different tree heights on each of our laptops.
They both finished and we submitted them with a minute and 20 seconds to go until the end of the competition. We jumped all the way to third place, which was really exciting. The person who won the competition used the exact same approach as we did, but had been running it since the start of the competition which suggests that we may well have been able to win if we had more time to fiddle with the solution parameters.
Thoughts about the data
The data we were provided with by EMI contained a lot of information. We found, however, that the demographic information did not improve our classification accuracy at all. There are a couple of conclusions we could draw from this. The first is that music taste is not effected by age, gender, region or any of the other information we were provided with. I'm not sure I believe that 94 year old males have the same listening tastes as 16 year old females, so I'm going to reject this conclusion.
The more likely conclusion is that there wasn't enough data provided for demographic information to help. Every time you split by demographic you reduce the size of your training and validation sets. This means that the accuracy of the individual classifiers are reduced, and as such the accuracy of the overall classifier of all the bins is also reduced. Given a couple of order of magnitudes more data it might well have been the case that we were able to produce an accurate classifier based on demographic information.
I had a great time at the Data Science Hackathon, I would very much like to participate in another one in the future. There were prizes, free t-shirts, free good food and really excellent people who understand a lot more about machine learning and data mining than I do. I'm really really glad that I went. I'd like to give a special shout out to Ben for being an awesome teammate, Greg for being supportive overnight when I began to burn out and Carlos for running things and just being a generally awesome dude.