Tuesday, October 14, 2008

There is evil there that does not sleep; the Great Eye is ever watchful.

Singular Value Decompostion. SVD. This is one of the most talked about and documented algorithms used in the Netflix challenge. It is one of great simplicity... but also of great power.

Applied to the Netflix data in it's most basic form, the SVD is a method which automatically assigns a number of factors to each movie and the corresponding factors to each user. Movie factors basically represents aspects of a movie which has influenced user ratings in the sample set. User factors reprensent how much each user is influenced by those specific aspects of the movies. And the magic comes from the fact that, by optimizing on the training data set, the aspects that most influence users are discovered automatically.

This algorithm is not only very good at predicting future user ratings, it also gets very interesting when you analyse its results. One way to look at the SVD results is to build movie lists by sorting them along the different factors and then taking the extremities (top and bottom movies for each factor). To support this blog, we ran an SVD with 8 factors and published such movie lists.

Categorizing movies this way can be fun. Seeing some of my favorite movies Fight Club, Seven, American Beauty, Memento and Jackass (don't judge me for liking to watch idiots hurt themselves...) bunched in a specific category (factor 3) is pretty cool. But for me, this analysis gets interesting when you think about what this can tell you about users. I'm sure people don't realize when they rate movies like this, that they're actually giving the site a lot of information about themselves (and not just their taste in movies).

For example, if someone has a very high value for factor 1, I would bet a lot of money that they wear skirts and makeup (I would've said that they were women, but I didn't want to offend anyone). Also, I'm pretty sure that churches, NRA meetings and republican conventions are litered with low raking factor 3s (my arch nemesis) and low ranking factor 6s. Conversly, I'm sure the democrats would find some supporters in low ranking factor 8s. This analysis is somewhat naive and simplistic, but with some additional work, I'm sure sex, age, race, income, etc. could be inferred with fairly high accuracy, simply by analysing people's movie ratings.

So next time you're registering on a web site and think you're going under-cover by not filling out the demographic information, think again... the Great Eye is ever watchful.

Perhaps this is why all these ad-placement companies keep sending us job offers.

8 comments:

Anonymous said...

Hello,

Where can I find the source code for SVD which incorporates the Netflix Prize Data?

And how much RMSE does SVD only algorithm achieve on the leaderboard?

Thanks.

Anonymous said...

Edit: I am looking for the SVD source code written in Python.

mj said...

I find this post terribly fascinating!
I'm not familiar with the algorithms you mention or how they work. Are these 8 factors exhaustive or are there other factors you've isolated? Have you given tentative names to these factors?
1 - Comforting vs Bleak
2 - Playful vs Somber
5 - Imaginary vs Reality
8 - Children vs Adult themes
etc
Very interested in seeing where this goes. Good luck!

Anonymous said...

Are there any SVD implementations you recommend, no matter what language or platform?

miked98 said...

You write: "I'm sure sex, age, race, income, etc. could be inferred with fairly high accuracy, simply by analysing people's movie ratings."

Your comment would imply that if someone's movie ratings predicts their sex, that sex might in turn predict movie ratings.

But based on what many others Netflix prize competitors have said, this isn't the case. As Clive Thompson's wrote in his recent NYT Magazine article, "the fact that I’m a 40-year-old West Village resident is not very predictive."

Would you agree?

Anonymous said...

What about household accounts? Individual preferences vary amongst users of a particular account, right?

PragmaticTheory said...

mj, no this is not an exhaustive list. This algorithm can be run with a various number of factors and a various number of options. Some lists are interesting like the one presented here (and can yield some good interpretations, as yours are). Other lists are more obscure, as the computer is capturing things in a way that are not understandable on a human level.

mike98, I don't agree with your reasoning here. Clive mentions that demographics would not help predict movie ratings and he's right. What I'm saying is that movie ratings can help predict demographics.

For example, not ALL women love Sleepless in Seattle. So knowing that you're a woman could not help me decide if you gave Sleepless in Seattle 1 star or 5 stars. On the other hand, most people who love Sleepless in Seattle are women, so knowing that you gave this movie a 5 has me well on my way to guessing that you're a woman.

Of course, this is not an exact science and would most likely not work 100% of the time. But I figure the more movies you rate, the more Netflix (or some other site) knows about you, and that's what my post is about.

Anonymous(3), yes, you're right, household accounts would have some strange ratings, as preferences vary within the same account (we've noticed that in some of our analysis). With a higher number of factors, the SVD can capture some more complex tastes and should be able to resolve the "household account" issue. Unfortunately, with a higher number of factors, the movie lists become incomprehensible and so are not much fun to analyse.

Alex said...

Are your algorithms trying to treat household accounts as though they were a single person?