Tuesday, October 14, 2008

There is evil there that does not sleep; the Great Eye is ever watchful.

Singular Value Decompostion. SVD. This is one of the most talked about and documented algorithms used in the Netflix challenge. It is one of great simplicity... but also of great power.

Applied to the Netflix data in it's most basic form, the SVD is a method which automatically assigns a number of factors to each movie and the corresponding factors to each user. Movie factors basically represents aspects of a movie which has influenced user ratings in the sample set. User factors reprensent how much each user is influenced by those specific aspects of the movies. And the magic comes from the fact that, by optimizing on the training data set, the aspects that most influence users are discovered automatically.

This algorithm is not only very good at predicting future user ratings, it also gets very interesting when you analyse its results. One way to look at the SVD results is to build movie lists by sorting them along the different factors and then taking the extremities (top and bottom movies for each factor). To support this blog, we ran an SVD with 8 factors and published such movie lists.

Categorizing movies this way can be fun. Seeing some of my favorite movies Fight Club, Seven, American Beauty, Memento and Jackass (don't judge me for liking to watch idiots hurt themselves...) bunched in a specific category (factor 3) is pretty cool. But for me, this analysis gets interesting when you think about what this can tell you about users. I'm sure people don't realize when they rate movies like this, that they're actually giving the site a lot of information about themselves (and not just their taste in movies).

For example, if someone has a very high value for factor 1, I would bet a lot of money that they wear skirts and makeup (I would've said that they were women, but I didn't want to offend anyone). Also, I'm pretty sure that churches, NRA meetings and republican conventions are litered with low raking factor 3s (my arch nemesis) and low ranking factor 6s. Conversly, I'm sure the democrats would find some supporters in low ranking factor 8s. This analysis is somewhat naive and simplistic, but with some additional work, I'm sure sex, age, race, income, etc. could be inferred with fairly high accuracy, simply by analysing people's movie ratings.

So next time you're registering on a web site and think you're going under-cover by not filling out the demographic information, think again... the Great Eye is ever watchful.

Perhaps this is why all these ad-placement companies keep sending us job offers.