Pragmatic Theory

Netflix Grand Prize technical presentation

2009-10-01T09:51:00.004-04:00

The slides that we used for the Netflix Grand Prize technical presentation on September 21, 2009 are available here.

By a nose... or is it a hair...

2009-07-28T10:25:00.001-04:00

Wow. What a crazy 24 hours that was.

After being quietly confident about our position but utterly nervous about the general silence on the leaderboard, we were struck by lightning when the newest coalition of coalitions, The Ensemble, submitted an entry just above ours, taking over first place merely 24 hours before the contest deadline. We were, of course, expecting the other parties to come to an agreement and join forces, but we were somehow hoping that they would come up short.

Well, with 24 hours to go and at least 0.01% to come up with, we weren't going to go down without a fight. Many people offered their help (a big thanks to all that did). Predictors were blended in. New techniques were tried out. Code was written. Nothing seemed to be helping to tip the scale... In the end, with less than a half hour to go, Yehuda and Martin P. scraped up a few new predictors, Michael and Andreas worked some more blending magic and we barely made it to a 10.09% tie for first. We had accomplished the day's mission and were now hoping that our test set score would be good enough to edge out a win.

Four short minutes before the end of the competition, another lightning bolt. The Ensemble had submitted at 10.10% and had appeared to have sealed the deal. We could now only pray that they had overfit the quiz set. We too had done our fair share of quiz set blending, but we had the advantage of having had a month's worth of experiments to tweak the regularization. Now, the contest was over anyway, it was out of our hands and all we could do was wait.

The wait was excruciating. Without much enthusiasm emails were exchanged internally. Hope was slim. Then the word came out on twitter that the Ensemble had won. It was over. We had lost.

All of a sudden, when we were no longer expecting it, an email from Netflix came in... Subj: "Netflix Prize Grand Prize Verification"... "Congratulations! The test subset performance of your team ”BellKor's Pragmatic Chaos” on the following submission makes your team the current top contender for the Grand Prize."... Sorrow turned into Joy... We had succeeded. We couldn't believe it. After having lost all hope, we had come out on top. Now the only thing standing between us and the Grand Prize was the verification process. Truly an amazing and unexpected finish.

We won't know the details of the test set results for a little while, but it's possible that we actually finished with the same test score as The Ensemble. If that is the case, then the tie breaker would be the submission time of those tied entries, which are most likely both the ones from July 26th. That would mean that we have the lead only because we submitted our final result 20 minutes before theirs. Almost three years of competition may have come down to 20 short minutes. Again, amazing.

As we enter the evaluation process, we would like to thank everyone for their participation in this contest. In the end, there can only be one winner, but it wouldn't have been such a great competition without everyone else out there that worked long and hard hours on this crazy project. We have met truly great and interesting people along the way. Walking this long and winding road with all of you was certainly the best part of this adventure.

Cheers!

What's in a name?

2009-06-23T21:19:00.001-04:00

As most of our readers must have already seen, we made a big splash today by forming a coalition with our closest competitors. There will be time to answer all of the burning questions about the joined team, but for now, I would like to start on a lighter note: the team name.

While BellKor's Pragmatic Chaos may not be the sexiest of names, in the end, it was chosen because we felt that it best served the main purpose which was to give credit to each joining team and to provide instant recognition of what the new team represented. Also, it had a better ring to it than some other combinations like Pragmatic BellKor Chaos or PT in BK in BT.

This was a tough decision, because we came up with quite a few creative ideas. Here is a rundown of all the names that were discussed along the way. Credit goes out to all members of the coalition.

First runner up:
The Usual Suspects - This idea stems from a quote in the movie Casablanca ("Round up the usual suspects"). While this is certainly a catchy name, we (PT) didn't feel that it included us entirely because we have not been recognized officially in the past, so are not immediate "usual suspects".

Second runner up:
Million Dollar Baby - A most appropriate movie reference. This was the early favorite, but was eliminated because we felt that putting emphasis on the financial aspect didn't represent the spirit of the contest or of our coalition. This is also why we eliminated Show Me The Money as a potential name.

Category cocky:
Resistance is Futile - With the release of the new Star Trek movie, we thought that this quote was pretty cool... but perhaps a bit too aggresive.
The Dream Team - Again, a bit too arrogant, but this one is also a funny movie reference. Imagine a bunch of patients in a psychiatric ward working on the Netflix prize...
Catch Us If You Can - Another movie reference, but we didn't want to tempt people into actually catching us...

Category movie quotes:
Gonna Need a Bigger Boat (Jaws) - I love this one... I can imagine 7 guys from around the globe, that don't know each other too much, piled into this small life raft trying to get away from this huge shark... "Yeah, hmmm, I think we're gonna need a bigger boat here..."

Other suggested quotes:
Not in Kansas Anymore (Wizard of Oz)
Go Ahead, Make My Day (Sudden Impact)
Like A Box of Chocolates (Forrest Gump )
Another Nice Mess (Laurel and Hardy)
The Kindness of Strangers (Streetcar Named Desire)

Miscellaneous:
Going All In - I actually liked this poker reference a lot... how it indicated that this was the final hand, win or lose.
All Aboard
First and Ten
Mission accomplished
42 - The answer to life, the universe and everything.
A small step for math

You see, us engineers, math wizes and scientists can also be creative... but in the end, we do make the most logical choice... that's just the way we are.

Netflix working on top secret project?

2009-05-03T20:37:00.000-04:00

The people at Netflix are a clever bunch. Very clever indeed. All this time, they have led us to believe that the goal of this contest was to improve their movie recommendation engines. Well we, at Pragmatic Theory, have uncovered the truth behind this sham.

The reality is that the goal of this contest is to keep the brightest minds in the world occupied, working on this futile project, so that their scientists can be the first to complete work on their real mission: time travel.

This might sound a bit unrealistic, and you may ask us "Do you have any proof sir?"... oh but of course... yes, we have discovered hard evidence that, not only are they working on a time travel machine, but in fact, they have already found a breach in the space-time continuum...... on with the facts.

By closely examining the Netflix dataset, one can find that 7 movies have a release year of NULL. While this is strange in itself, and puzzled us at first, we fortunately found help in the good old Netflix prize forum. Some great people have corrected this obvious mistake by finding the proper DVD release years. But here's the kicker... upon examining the dataset closer, it seems that many customers have rated these movies two, three, even four years before their actual release date. To us, the only logical explanation is that Netflix has a working time portal and has allowed a select few customers the piviledge to use it. Of course, these movie buffs did the only logical thing when being propelled into the future... rent some new releases.

What? This isn't enough proof? Ahh but there's more...

In this other post on the forum, the prizemaster indicates that it is OK to use the data published as part of the KDD cup 2006. This is an additional set of ratings of the same movies, by the same users, but in the year 2006. Great! More data is good. But wait a minute... close examination of the dataset shows that 6 customer-movie pairs are found both in the Netflix prize quiz set AND in the KDD cup set... how can these people have rated the same movie in 2005 AND in 2006... bingo. Time travel.

Oh, I can hear you from here: "People are allowed to re-rate movies on the Netflix site"... ahhhh but why would someone re-rate a movie, only to give it the same rating? Impossible.

Those Netflix chaps thought they had it all planned out... good thing that we have uncovered this little plot... now perhaps we can beat them to the punch... if I can just find a street long enough to get this DeLorean to hit 88 MPH...

Friday The 13th

2009-03-13T20:59:00.000-04:00

[LP]Hey, I saw you finally made it to number one in that netflix contest. Did you implement that idea I gave you... you know, the one where you boost the ratings of horror films on Friday the 13th?
[PT] (sigh)

No we didn't use movie metadata in our final push to number one. Instead, we relied on our original methodology... As the full moon sailed high, at midnight on Friday the 13th, we slaughtered a goat and offered its liver up to the gods...

Seriously, we're very happy about our recent progress and achieving this milestone in a little over a year. Thanks to everyone who has supported our team in various ways. Since we don't know how long we'll be on top, we captured this moment for posterity: http://pragmatictheory.googlepages.com/numberone

Now I'll go back to my BBQ. That goat will make a great roast.

All the way around the sun...

2009-03-03T20:46:00.000-05:00

Thirty-one million five-hundred-thirty-six thousand seconds ago, we were two guys with an itch and a bit of spare time.
Five-hundred-twenty-five thousand six hundred minutes ago, we read an article about an intersting contest.
Eight thousand seven hundred sixty hours ago, we got a couple of ideas.
Three-hundred and sixty-five days ago, we knew nothing about collaborative filtering or matrix factorization.
Fifty-two weeks ago, we wondered how far engineering could take us in this world full of scholars, PhDs and other geniuses.
Twelve months ago, we decided to give it a shot.

Today marks the one year anniversary of team Pragmatic Theory.

Let's take to look back at some of the intersting milestones we achieved this year:
- March 9th 2008 : First submission ever. Not very impressive: 0.9862
- May 10th 2008: Cracking the top 40. A little over 2 months in: 0.8822
- June 1st 2008: Cracking the top 10. The hill is getting steeper: 0.8731
- June 18th 2008: Above the Progress Prize 2007 line. Sixth place: 0.8707
- June 20th 2008: Cracking the top 5. June was a good month: 0.8699
- September 8th 2008: Second Place. Not for very long: 0.8655
- November 21st 2008: New-York Times article revealing that we are, indeed, geeks.
- December 26th 2008: Back in second place and top individual team. Merry Christmas: 0.8620
- January 11th 2009: Breaking the Progress Prize 2008. Things are getting interesting: 0.8614
- February 27th 2009: When you type "pragmatic the" in Google, it actually suggests "pragmatic theory netflix".

To mark this wonderful anniversary, we decided to reveal some of our secrets. Follow this link to find out more.

8756

2009-02-12T22:04:00.000-05:00

We have been working recently on variants of BellKor's integrated model as described in their 2008 progress prize paper. We obtained results very similar from the published numbers: our implementation achieved 0.8790 RMSE on the Quiz Set (f=200), compared to the reported 0.8789.

This model proved superior to our own flavor of integrated model. However, what is interesting is that we were able to leverage the best of both models and combine them together. This combined model achieved a Quiz set RMSE of 0.8756 (f=200). This is, to our knowledge, the best reported number for a model without blending. On today's leaderboard, this would achieve the 47th rank by itself.

Progress Prize 2008

2008-12-10T19:59:00.000-05:00

The long awaited Progress Prize 2008 as finally been awarded. Of course, I immediately rushed to download the supporting papers and learn what is in the new super-accurate model discussed on BellKor’s web, and what other goodies are in BellKor’s and BigChaos’ solution. If you haven’t checked the papers yet, do it before continuing to read this (references can be found in the NetflixPrize forum).

It appears that what’s new in 2008 is mostly about exploiting the dates. BellKor’s 2007 solution used the date in the global effects, but that was about it. It seems logical that 2008 brings more ways of making use of it. That’s not a surprise: much of PragmaticTheory’s recent improvement has been about using dates too. Still, there are differences, so we’ll see where that leads us. I know what I’ll be doing over the Christmas Holidays.

What worries me is the approach using billions of parameters. My poor home PC can’t do that. Running Windows XP Home Edition, a process is limited to about 1.6 GB of memory (2GB application address space minus the address space lost due to DLL mapping). With about 400MB used for the training data (typically), that leaves about 150M double precision parameters, far from the required number. Going single precision raises the number to 300M, still far short. Running from disk is my only option, but the poor disk is almost full!

What’s funny is that 10 billion parameters is not only much larger than the training data size (roughly 100 million ratings), but it is even larger than the 17770 movies X 480189 users (approximately 8.5 billion) problem space. Still, the model introduces a third dimension (time) with 2243 different dates, resulting in a problem space of 2243 X 17770 X 480189 = 19139 billion (almost the cost of a bank bailout). Fair enough, but I still have to ask Santa for a new PC.

Enough rambling, I need to write some code. We’ve been falling behind…

Martin for PragmaticTheory

There is evil there that does not sleep; the Great Eye is ever watchful.

2008-10-14T22:34:00.001-04:00

Singular Value Decompostion. SVD. This is one of the most talked about and documented algorithms used in the Netflix challenge. It is one of great simplicity... but also of great power.

Applied to the Netflix data in it's most basic form, the SVD is a method which automatically assigns a number of factors to each movie and the corresponding factors to each user. Movie factors basically represents aspects of a movie which has influenced user ratings in the sample set. User factors reprensent how much each user is influenced by those specific aspects of the movies. And the magic comes from the fact that, by optimizing on the training data set, the aspects that most influence users are discovered automatically.

This algorithm is not only very good at predicting future user ratings, it also gets very interesting when you analyse its results. One way to look at the SVD results is to build movie lists by sorting them along the different factors and then taking the extremities (top and bottom movies for each factor). To support this blog, we ran an SVD with 8 factors and published such movie lists.

Categorizing movies this way can be fun. Seeing some of my favorite movies Fight Club, Seven, American Beauty, Memento and Jackass (don't judge me for liking to watch idiots hurt themselves...) bunched in a specific category (factor 3) is pretty cool. But for me, this analysis gets interesting when you think about what this can tell you about users. I'm sure people don't realize when they rate movies like this, that they're actually giving the site a lot of information about themselves (and not just their taste in movies).

For example, if someone has a very high value for factor 1, I would bet a lot of money that they wear skirts and makeup (I would've said that they were women, but I didn't want to offend anyone). Also, I'm pretty sure that churches, NRA meetings and republican conventions are litered with low raking factor 3s (my arch nemesis) and low ranking factor 6s. Conversly, I'm sure the democrats would find some supporters in low ranking factor 8s. This analysis is somewhat naive and simplistic, but with some additional work, I'm sure sex, age, race, income, etc. could be inferred with fairly high accuracy, simply by analysing people's movie ratings.

So next time you're registering on a web site and think you're going under-cover by not filling out the demographic information, think again... the Great Eye is ever watchful.

Perhaps this is why all these ad-placement companies keep sending us job offers.

9%

2008-09-08T20:10:00.000-04:00

Well, we finally reached the 9% improvement mark today. We have been trying to achieve this milestone for a while now. Surprisingly, it did not come out of a new algorithm. The final step was the result of a minor improvement to an algorithm we implemented last spring. Progress is slow, sometimes it seems like 10% is going to take forever, if at all possible.

You want the truth, you can't HANDLE the truth!

2008-08-03T21:20:00.000-04:00

Movie data. Most of the top teams competing in the netflix challenge must have had to answer a lot of questions about movie data. Here is an actual conversation I had with a friend a couple of weeks ago (Note that I had had a few beers at the time, so don't go to court with any of these quotes).

"[Lay Person - Not the actual name of the person]: Hey, so I heard you're competing in that netflix challenge thing. Pretty cool.
[Pragmatic Theory]: Yeah.
[LP]: So what kind of data do you get?
[PT]: Movie titles and years, user identification number, rating and date of rating.
[LP]: That's it? No information on movie genre or anything?
[PT]: Nope.
[LP]: That's strange... isn't it really important to predict user ratings?
[PT]: (*having a sip, knowing where this is going*)
[LP]: Hey! I have an idea! Did you guys think of mining this information on IMDB or something?
[PT]: Well, actually, external movie data is not useful. The algorithms find the proper classifications automatically.
[LP]: (*pause*) Huh?
[PT]: For example, the movie genre on a site like Netflix, Amazon or IMDB is the opinion of one person on how movies should be categorized. The algorithms actually find categories that indicate how movies influence all users.
[LP]: (*dumb look* - LP has also had a few drinks) But wouldn't your algorithms just be better with more data?
[PT]: Believe me, movie data is really not useful.
[LP]: (*looking unconvinced*) OK... you're sure?... huh.... Really?
[PT]: (*having a bigger sip*)
[LP]: OK then... What about user information? Do you have any of that? I'm sure if you knew user's sex, age group and such, that would help to make predictions... and I'm sure netflix asks that when you register... men and women don't have the same tastes in movies, that's for sure (*unconfortable laugh*)
[PT]: Nope, no user information either. And that wouldn't be useful anyway. The algorithms actually find these type of user classifications automatically too...
[LP]: (*dumb struck*) Wha?
[PT]: Movie or user data is just not helpful because the different algorithms are just too good at capturing the details and nuances that influence user ratings... Believe me, we tried!
[LP]: (*stares in disbelief and walks away thinking that I don't understand this problem and that he would do better...*)
[PT]: (*chugging the rest of my beer*)"

A couple of months ago, that could have been me arguing with someone about the usefulness of external movie data. Team PragmaticTheory was actually founded with the belief that we could do better than other teams because we did not have this pre-conceived notion that movie data was useless. We would implement all the machine learning algorithms, then add some data from various sources... and we would surely beat out the top teams and win the million... Boy, were we wrong!

One of the first things I did on this project was to mine a couple of sites (talk to my lawyers to find out how and which ones) to see if we could get good coverage on the movies in the dataset. I actually did pretty well and we got a good set of movie data to play with. This data was actually useful in the first few weeks. The models using it did better than some of our early pure machine learning algorithms. Unfortunately, as soon as we started implementing some of the more common, documented algorithms, the movie-data-based models got pruned out of the mix. We tried to get a bit fancier and build some more complex algorithms around the movie data. Still, the pure machine learning ones are systematically better.

Why? Well, my interpretation is that movie data is just too black and white. User tastes are infinite shades of grey (think floating point shades of grey). It's not true that someone likes all sci-fi movies. And no one can enjoy all the Tom Hanks movies equally. But the algorithms can figure out the subtle nuances that define user rating patterns. It can figure out that you really like sci-fi comedies that have a happy ending, but that you enjoy the sci-fi/horror genre, where one of the main characters dies, a bit less. It can also figure out that you're a huge fan of Tom Hanks, but that you hate sappy girly flicks... so even if your favorite man is there, there's no saving Sleepless In Seattle and You've Got Mail from being sent to the junk pile.

My explanation is a bit simplistic, but honestly, to anyone out there that still has any doubts that extra movie data may be useful to predict user ratings, I say that you have to have faith in the machine. It's just smarter than we are.

Blending 101

2008-07-26T08:05:00.000-04:00

(a.k.a. why one submission a day is enough)

Serious attempts at the Netflix challenge require blending results from many algorithms. Blending is an operation that transforms multiple estimates into a single higher accuracy estimate. This is a brief tutorial of the steps involved. Experienced Netflix participants should not bother to read further.

Step 1: construct a reduced training set

To blend the models, you need to construct a reduced training set by excluding from Netflix provided training set all ratings present in the probe set.

Step 2: train you different models on the reduced training set

For now we train each individual model on the reduced training set. Later we will re-train all the models on the full training set. To re-train in a consistent way, it is critical to record carefully at this step all the parameters used, the number of training epoch, etc.

Step 3: predict the probe set

For each model trained on the reduced training set, predict the probe set.

Step 4: select you favorite blending recipe

This step receives as input the predicted probe set results from each model, and the real probe set ratings. The output is a function that mixes the individual model predictions into a blended prediction, hopefully better than any individual result. A simple linear regression will get you a long way, but feel free to experiment with your favorite machine learning algorithm. What is key here, is that any unknown coefficient (for example the linear regression coefficients) can be selected to minimize the error between the blended prediction and the real probe set scores.

N.B. If over fitting the blending function is an issue, partition the probe set in two. Use one part for training the function, and the other for cross-validation.

Step 5: re-train all models using the full training set

At this point, we are preparing for our final predictions. To get the best possible accuracy, we re-train all models using the full training set. The addition of the probe set data in the training data can result in an accuracy boost of 0.0060 or more.

Step 6: blend the re-trained models

Here's the leap of fate. We assume that the blending function we computed at step 4 is still a valid function to blend the re-trained models. For this to work, the two sets of models must be computed with a rigorously equivalent methodology. Also, the selected function from step 4 must be a valid generalization and avoid over fitting. This is not an issue with a simple linear regression, but may become problematic for complex machine learning methods with many degrees of freedom.

Step 7: clamp the final predictions

Here's a hint: clamping values between 1 and 5 is not optimal.

If this well done, then improvements in the models can be measured after step 4 by comparing the accuracy on the probe set. Values on the qualifying set will be better by 0.0060 or more, but this offset should be very consistent from one submission to another. Lately I have been getting 0.0068 +/- 0.0001.

A little humor to start things off...

2008-07-22T13:34:00.000-04:00

Here is the text from the web page we initially wanted to put up:

PragmaticTheory : Solving the netflix challenge through divination...

Following the concepts of numerology, team PragmaticTheory was formed on July 7th, 2007 (7/7/7) and started working on the netflix challenge 7 months, 7 weeks and 7 days later. The team consists of 2 human beings with respectively 2 eyes, 2 arms, 2 legs and, most importantly, 2 powerfully energized shakras.

Our strategy is not to use un-proven techniques such as mathematics, matrixes and algorithms. Instead, we are tapping into the universe's hidden powers to uncover user's force fields in order to forecast their ratings with high accuracy. Skeptics will most likely dismiss our techniques as mere superstition, but we think that the progress that we've made so far on the netflix leaderboard speaks for itself.

Here are some more details on the divination methods that our team uses:

- Saggitarius Virgo Decomposition (SVD) : The astralogical signs of the users in the dataset are captured through astral vibration sensors and cross-referenced with star maps and tide charts to predict future movie ratings.

- Asymmetrical Tea-Leaf Models : An array of tea kettles and cups were used to generate 50 million tea-leaf patterns which were individually digitally photographed. A pattern recognition software was implemented to detect assymetries and assign weights accordingly.

- Red-King Black-Queen Machine (RBM) : Tarot spreads were simulated for each user in the dataset and their fate interpreted through an automated karma analyser. Sadly, thousands of users were declared dead at the time of their predicted rating, yielding major sparsness issues...