Saturday, July 26, 2008

Blending 101

(a.k.a. why one submission a day is enough)

Serious attempts at the Netflix challenge require blending results from many algorithms. Blending is an operation that transforms multiple estimates into a single higher accuracy estimate. This is a brief tutorial of the steps involved. Experienced Netflix participants should not bother to read further.

Step 1: construct a reduced training set

To blend the models, you need to construct a reduced training set by excluding from Netflix provided training set all ratings present in the probe set.

Step 2: train you different models on the reduced training set

For now we train each individual model on the reduced training set. Later we will re-train all the models on the full training set. To re-train in a consistent way, it is critical to record carefully at this step all the parameters used, the number of training epoch, etc.

Step 3: predict the probe set

For each model trained on the reduced training set, predict the probe set.

Step 4: select you favorite blending recipe

This step receives as input the predicted probe set results from each model, and the real probe set ratings. The output is a function that mixes the individual model predictions into a blended prediction, hopefully better than any individual result. A simple linear regression will get you a long way, but feel free to experiment with your favorite machine learning algorithm. What is key here, is that any unknown coefficient (for example the linear regression coefficients) can be selected to minimize the error between the blended prediction and the real probe set scores.

N.B. If over fitting the blending function is an issue, partition the probe set in two. Use one part for training the function, and the other for cross-validation.

Step 5: re-train all models using the full training set

At this point, we are preparing for our final predictions. To get the best possible accuracy, we re-train all models using the full training set. The addition of the probe set data in the training data can result in an accuracy boost of 0.0060 or more.

Step 6: blend the re-trained models

Here's the leap of fate. We assume that the blending function we computed at step 4 is still a valid function to blend the re-trained models. For this to work, the two sets of models must be computed with a rigorously equivalent methodology. Also, the selected function from step 4 must be a valid generalization and avoid over fitting. This is not an issue with a simple linear regression, but may become problematic for complex machine learning methods with many degrees of freedom.

Step 7: clamp the final predictions

Here's a hint: clamping values between 1 and 5 is not optimal.

If this well done, then improvements in the models can be measured after step 4 by comparing the accuracy on the probe set. Values on the qualifying set will be better by 0.0060 or more, but this offset should be very consistent from one submission to another. Lately I have been getting 0.0068 +/- 0.0001.

Tuesday, July 22, 2008

A little humor to start things off...

Here is the text from the web page we initially wanted to put up:

PragmaticTheory : Solving the netflix challenge through divination...

Following the concepts of numerology, team PragmaticTheory was formed on July 7th, 2007 (7/7/7) and started working on the netflix challenge 7 months, 7 weeks and 7 days later. The team consists of 2 human beings with respectively 2 eyes, 2 arms, 2 legs and, most importantly, 2 powerfully energized shakras.

Our strategy is not to use un-proven techniques such as mathematics, matrixes and algorithms. Instead, we are tapping into the universe's hidden powers to uncover user's force fields in order to forecast their ratings with high accuracy. Skeptics will most likely dismiss our techniques as mere superstition, but we think that the progress that we've made so far on the netflix leaderboard speaks for itself.

Here are some more details on the divination methods that our team uses:

- Saggitarius Virgo Decomposition (SVD) : The astralogical signs of the users in the dataset are captured through astral vibration sensors and cross-referenced with star maps and tide charts to predict future movie ratings.

- Asymmetrical Tea-Leaf Models : An array of tea kettles and cups were used to generate 50 million tea-leaf patterns which were individually digitally photographed. A pattern recognition software was implemented to detect assymetries and assign weights accordingly.

- Red-King Black-Queen Machine (RBM) : Tarot spreads were simulated for each user in the dataset and their fate interpreted through an automated karma analyser. Sadly, thousands of users were declared dead at the time of their predicted rating, yielding major sparsness issues...