Sunday, August 3, 2008

You want the truth, you can't HANDLE the truth!

Movie data. Most of the top teams competing in the netflix challenge must have had to answer a lot of questions about movie data. Here is an actual conversation I had with a friend a couple of weeks ago (Note that I had had a few beers at the time, so don't go to court with any of these quotes).

"[Lay Person - Not the actual name of the person]: Hey, so I heard you're competing in that netflix challenge thing. Pretty cool.
[Pragmatic Theory]: Yeah.
[LP]: So what kind of data do you get?
[PT]: Movie titles and years, user identification number, rating and date of rating.
[LP]: That's it? No information on movie genre or anything?
[PT]: Nope.
[LP]: That's strange... isn't it really important to predict user ratings?
[PT]: (*having a sip, knowing where this is going*)
[LP]: Hey! I have an idea! Did you guys think of mining this information on IMDB or something?
[PT]: Well, actually, external movie data is not useful. The algorithms find the proper classifications automatically.
[LP]: (*pause*) Huh?
[PT]: For example, the movie genre on a site like Netflix, Amazon or IMDB is the opinion of one person on how movies should be categorized. The algorithms actually find categories that indicate how movies influence all users.
[LP]: (*dumb look* - LP has also had a few drinks) But wouldn't your algorithms just be better with more data?
[PT]: Believe me, movie data is really not useful.
[LP]: (*looking unconvinced*) OK... you're sure?... huh.... Really?
[PT]: (*having a bigger sip*)
[LP]: OK then... What about user information? Do you have any of that? I'm sure if you knew user's sex, age group and such, that would help to make predictions... and I'm sure netflix asks that when you register... men and women don't have the same tastes in movies, that's for sure (*unconfortable laugh*)
[PT]: Nope, no user information either. And that wouldn't be useful anyway. The algorithms actually find these type of user classifications automatically too...
[LP]: (*dumb struck*) Wha?
[PT]: Movie or user data is just not helpful because the different algorithms are just too good at capturing the details and nuances that influence user ratings... Believe me, we tried!
[LP]: (*stares in disbelief and walks away thinking that I don't understand this problem and that he would do better...*)
[PT]: (*chugging the rest of my beer*)"

A couple of months ago, that could have been me arguing with someone about the usefulness of external movie data. Team PragmaticTheory was actually founded with the belief that we could do better than other teams because we did not have this pre-conceived notion that movie data was useless. We would implement all the machine learning algorithms, then add some data from various sources... and we would surely beat out the top teams and win the million... Boy, were we wrong!

One of the first things I did on this project was to mine a couple of sites (talk to my lawyers to find out how and which ones) to see if we could get good coverage on the movies in the dataset. I actually did pretty well and we got a good set of movie data to play with. This data was actually useful in the first few weeks. The models using it did better than some of our early pure machine learning algorithms. Unfortunately, as soon as we started implementing some of the more common, documented algorithms, the movie-data-based models got pruned out of the mix. We tried to get a bit fancier and build some more complex algorithms around the movie data. Still, the pure machine learning ones are systematically better.

Why? Well, my interpretation is that movie data is just too black and white. User tastes are infinite shades of grey (think floating point shades of grey). It's not true that someone likes all sci-fi movies. And no one can enjoy all the Tom Hanks movies equally. But the algorithms can figure out the subtle nuances that define user rating patterns. It can figure out that you really like sci-fi comedies that have a happy ending, but that you enjoy the sci-fi/horror genre, where one of the main characters dies, a bit less. It can also figure out that you're a huge fan of Tom Hanks, but that you hate sappy girly flicks... so even if your favorite man is there, there's no saving Sleepless In Seattle and You've Got Mail from being sent to the junk pile.

My explanation is a bit simplistic, but honestly, to anyone out there that still has any doubts that extra movie data may be useful to predict user ratings, I say that you have to have faith in the machine. It's just smarter than we are.