Wednesday, December 10, 2008

Progress Prize 2008

The long awaited Progress Prize 2008 as finally been awarded. Of course, I immediately rushed to download the supporting papers and learn what is in the new super-accurate model discussed on BellKor’s web, and what other goodies are in BellKor’s and BigChaos’ solution. If you haven’t checked the papers yet, do it before continuing to read this (references can be found in the NetflixPrize forum).


It appears that what’s new in 2008 is mostly about exploiting the dates. BellKor’s 2007 solution used the date in the global effects, but that was about it. It seems logical that 2008 brings more ways of making use of it. That’s not a surprise: much of PragmaticTheory’s recent improvement has been about using dates too. Still, there are differences, so we’ll see where that leads us. I know what I’ll be doing over the Christmas Holidays.


What worries me is the approach using billions of parameters. My poor home PC can’t do that. Running Windows XP Home Edition, a process is limited to about 1.6 GB of memory (2GB application address space minus the address space lost due to DLL mapping). With about 400MB used for the training data (typically), that leaves about 150M double precision parameters, far from the required number. Going single precision raises the number to 300M, still far short. Running from disk is my only option, but the poor disk is almost full!


What’s funny is that 10 billion parameters is not only much larger than the training data size (roughly 100 million ratings), but it is even larger than the 17770 movies X 480189 users (approximately 8.5 billion) problem space. Still, the model introduces a third dimension (time) with 2243 different dates, resulting in a problem space of 2243 X 17770 X 480189 = 19139 billion (almost the cost of a bank bailout). Fair enough, but I still have to ask Santa for a new PC.


Enough rambling, I need to write some code. We’ve been falling behind…


Martin for PragmaticTheory

Tuesday, October 14, 2008

There is evil there that does not sleep; the Great Eye is ever watchful.

Singular Value Decompostion. SVD. This is one of the most talked about and documented algorithms used in the Netflix challenge. It is one of great simplicity... but also of great power.

Applied to the Netflix data in it's most basic form, the SVD is a method which automatically assigns a number of factors to each movie and the corresponding factors to each user. Movie factors basically represents aspects of a movie which has influenced user ratings in the sample set. User factors reprensent how much each user is influenced by those specific aspects of the movies. And the magic comes from the fact that, by optimizing on the training data set, the aspects that most influence users are discovered automatically.

This algorithm is not only very good at predicting future user ratings, it also gets very interesting when you analyse its results. One way to look at the SVD results is to build movie lists by sorting them along the different factors and then taking the extremities (top and bottom movies for each factor). To support this blog, we ran an SVD with 8 factors and published such movie lists.

Categorizing movies this way can be fun. Seeing some of my favorite movies Fight Club, Seven, American Beauty, Memento and Jackass (don't judge me for liking to watch idiots hurt themselves...) bunched in a specific category (factor 3) is pretty cool. But for me, this analysis gets interesting when you think about what this can tell you about users. I'm sure people don't realize when they rate movies like this, that they're actually giving the site a lot of information about themselves (and not just their taste in movies).

For example, if someone has a very high value for factor 1, I would bet a lot of money that they wear skirts and makeup (I would've said that they were women, but I didn't want to offend anyone). Also, I'm pretty sure that churches, NRA meetings and republican conventions are litered with low raking factor 3s (my arch nemesis) and low ranking factor 6s. Conversly, I'm sure the democrats would find some supporters in low ranking factor 8s. This analysis is somewhat naive and simplistic, but with some additional work, I'm sure sex, age, race, income, etc. could be inferred with fairly high accuracy, simply by analysing people's movie ratings.

So next time you're registering on a web site and think you're going under-cover by not filling out the demographic information, think again... the Great Eye is ever watchful.

Perhaps this is why all these ad-placement companies keep sending us job offers.

Monday, September 8, 2008

9%

Well, we finally reached the 9% improvement mark today. We have been trying to achieve this milestone for a while now. Surprisingly, it did not come out of a new algorithm. The final step was the result of a minor improvement to an algorithm we implemented last spring. Progress is slow, sometimes it seems like 10% is going to take forever, if at all possible.

Sunday, August 3, 2008

You want the truth, you can't HANDLE the truth!

Movie data. Most of the top teams competing in the netflix challenge must have had to answer a lot of questions about movie data. Here is an actual conversation I had with a friend a couple of weeks ago (Note that I had had a few beers at the time, so don't go to court with any of these quotes).

"[Lay Person - Not the actual name of the person]: Hey, so I heard you're competing in that netflix challenge thing. Pretty cool.
[Pragmatic Theory]: Yeah.
[LP]: So what kind of data do you get?
[PT]: Movie titles and years, user identification number, rating and date of rating.
[LP]: That's it? No information on movie genre or anything?
[PT]: Nope.
[LP]: That's strange... isn't it really important to predict user ratings?
[PT]: (*having a sip, knowing where this is going*)
[LP]: Hey! I have an idea! Did you guys think of mining this information on IMDB or something?
[PT]: Well, actually, external movie data is not useful. The algorithms find the proper classifications automatically.
[LP]: (*pause*) Huh?
[PT]: For example, the movie genre on a site like Netflix, Amazon or IMDB is the opinion of one person on how movies should be categorized. The algorithms actually find categories that indicate how movies influence all users.
[LP]: (*dumb look* - LP has also had a few drinks) But wouldn't your algorithms just be better with more data?
[PT]: Believe me, movie data is really not useful.
[LP]: (*looking unconvinced*) OK... you're sure?... huh.... Really?
[PT]: (*having a bigger sip*)
[LP]: OK then... What about user information? Do you have any of that? I'm sure if you knew user's sex, age group and such, that would help to make predictions... and I'm sure netflix asks that when you register... men and women don't have the same tastes in movies, that's for sure (*unconfortable laugh*)
[PT]: Nope, no user information either. And that wouldn't be useful anyway. The algorithms actually find these type of user classifications automatically too...
[LP]: (*dumb struck*) Wha?
[PT]: Movie or user data is just not helpful because the different algorithms are just too good at capturing the details and nuances that influence user ratings... Believe me, we tried!
[LP]: (*stares in disbelief and walks away thinking that I don't understand this problem and that he would do better...*)
[PT]: (*chugging the rest of my beer*)"

A couple of months ago, that could have been me arguing with someone about the usefulness of external movie data. Team PragmaticTheory was actually founded with the belief that we could do better than other teams because we did not have this pre-conceived notion that movie data was useless. We would implement all the machine learning algorithms, then add some data from various sources... and we would surely beat out the top teams and win the million... Boy, were we wrong!

One of the first things I did on this project was to mine a couple of sites (talk to my lawyers to find out how and which ones) to see if we could get good coverage on the movies in the dataset. I actually did pretty well and we got a good set of movie data to play with. This data was actually useful in the first few weeks. The models using it did better than some of our early pure machine learning algorithms. Unfortunately, as soon as we started implementing some of the more common, documented algorithms, the movie-data-based models got pruned out of the mix. We tried to get a bit fancier and build some more complex algorithms around the movie data. Still, the pure machine learning ones are systematically better.

Why? Well, my interpretation is that movie data is just too black and white. User tastes are infinite shades of grey (think floating point shades of grey). It's not true that someone likes all sci-fi movies. And no one can enjoy all the Tom Hanks movies equally. But the algorithms can figure out the subtle nuances that define user rating patterns. It can figure out that you really like sci-fi comedies that have a happy ending, but that you enjoy the sci-fi/horror genre, where one of the main characters dies, a bit less. It can also figure out that you're a huge fan of Tom Hanks, but that you hate sappy girly flicks... so even if your favorite man is there, there's no saving Sleepless In Seattle and You've Got Mail from being sent to the junk pile.

My explanation is a bit simplistic, but honestly, to anyone out there that still has any doubts that extra movie data may be useful to predict user ratings, I say that you have to have faith in the machine. It's just smarter than we are.

Saturday, July 26, 2008

Blending 101

(a.k.a. why one submission a day is enough)

Serious attempts at the Netflix challenge require blending results from many algorithms. Blending is an operation that transforms multiple estimates into a single higher accuracy estimate. This is a brief tutorial of the steps involved. Experienced Netflix participants should not bother to read further.

Step 1: construct a reduced training set

To blend the models, you need to construct a reduced training set by excluding from Netflix provided training set all ratings present in the probe set.

Step 2: train you different models on the reduced training set

For now we train each individual model on the reduced training set. Later we will re-train all the models on the full training set. To re-train in a consistent way, it is critical to record carefully at this step all the parameters used, the number of training epoch, etc.

Step 3: predict the probe set

For each model trained on the reduced training set, predict the probe set.

Step 4: select you favorite blending recipe

This step receives as input the predicted probe set results from each model, and the real probe set ratings. The output is a function that mixes the individual model predictions into a blended prediction, hopefully better than any individual result. A simple linear regression will get you a long way, but feel free to experiment with your favorite machine learning algorithm. What is key here, is that any unknown coefficient (for example the linear regression coefficients) can be selected to minimize the error between the blended prediction and the real probe set scores.

N.B. If over fitting the blending function is an issue, partition the probe set in two. Use one part for training the function, and the other for cross-validation.

Step 5: re-train all models using the full training set

At this point, we are preparing for our final predictions. To get the best possible accuracy, we re-train all models using the full training set. The addition of the probe set data in the training data can result in an accuracy boost of 0.0060 or more.

Step 6: blend the re-trained models

Here's the leap of fate. We assume that the blending function we computed at step 4 is still a valid function to blend the re-trained models. For this to work, the two sets of models must be computed with a rigorously equivalent methodology. Also, the selected function from step 4 must be a valid generalization and avoid over fitting. This is not an issue with a simple linear regression, but may become problematic for complex machine learning methods with many degrees of freedom.

Step 7: clamp the final predictions

Here's a hint: clamping values between 1 and 5 is not optimal.

If this well done, then improvements in the models can be measured after step 4 by comparing the accuracy on the probe set. Values on the qualifying set will be better by 0.0060 or more, but this offset should be very consistent from one submission to another. Lately I have been getting 0.0068 +/- 0.0001.

Tuesday, July 22, 2008

A little humor to start things off...

Here is the text from the web page we initially wanted to put up:

PragmaticTheory : Solving the netflix challenge through divination...

Following the concepts of numerology, team PragmaticTheory was formed on July 7th, 2007 (7/7/7) and started working on the netflix challenge 7 months, 7 weeks and 7 days later. The team consists of 2 human beings with respectively 2 eyes, 2 arms, 2 legs and, most importantly, 2 powerfully energized shakras.

Our strategy is not to use un-proven techniques such as mathematics, matrixes and algorithms. Instead, we are tapping into the universe's hidden powers to uncover user's force fields in order to forecast their ratings with high accuracy. Skeptics will most likely dismiss our techniques as mere superstition, but we think that the progress that we've made so far on the netflix leaderboard speaks for itself.

Here are some more details on the divination methods that our team uses:

- Saggitarius Virgo Decomposition (SVD) : The astralogical signs of the users in the dataset are captured through astral vibration sensors and cross-referenced with star maps and tide charts to predict future movie ratings.

- Asymmetrical Tea-Leaf Models : An array of tea kettles and cups were used to generate 50 million tea-leaf patterns which were individually digitally photographed. A pattern recognition software was implemented to detect assymetries and assign weights accordingly.

- Red-King Black-Queen Machine (RBM) : Tarot spreads were simulated for each user in the dataset and their fate interpreted through an automated karma analyser. Sadly, thousands of users were declared dead at the time of their predicted rating, yielding major sparsness issues...