Sunday, August 3, 2008

You want the truth, you can't HANDLE the truth!

Movie data. Most of the top teams competing in the netflix challenge must have had to answer a lot of questions about movie data. Here is an actual conversation I had with a friend a couple of weeks ago (Note that I had had a few beers at the time, so don't go to court with any of these quotes).

"[Lay Person - Not the actual name of the person]: Hey, so I heard you're competing in that netflix challenge thing. Pretty cool.
[Pragmatic Theory]: Yeah.
[LP]: So what kind of data do you get?
[PT]: Movie titles and years, user identification number, rating and date of rating.
[LP]: That's it? No information on movie genre or anything?
[PT]: Nope.
[LP]: That's strange... isn't it really important to predict user ratings?
[PT]: (*having a sip, knowing where this is going*)
[LP]: Hey! I have an idea! Did you guys think of mining this information on IMDB or something?
[PT]: Well, actually, external movie data is not useful. The algorithms find the proper classifications automatically.
[LP]: (*pause*) Huh?
[PT]: For example, the movie genre on a site like Netflix, Amazon or IMDB is the opinion of one person on how movies should be categorized. The algorithms actually find categories that indicate how movies influence all users.
[LP]: (*dumb look* - LP has also had a few drinks) But wouldn't your algorithms just be better with more data?
[PT]: Believe me, movie data is really not useful.
[LP]: (*looking unconvinced*) OK... you're sure?... huh.... Really?
[PT]: (*having a bigger sip*)
[LP]: OK then... What about user information? Do you have any of that? I'm sure if you knew user's sex, age group and such, that would help to make predictions... and I'm sure netflix asks that when you register... men and women don't have the same tastes in movies, that's for sure (*unconfortable laugh*)
[PT]: Nope, no user information either. And that wouldn't be useful anyway. The algorithms actually find these type of user classifications automatically too...
[LP]: (*dumb struck*) Wha?
[PT]: Movie or user data is just not helpful because the different algorithms are just too good at capturing the details and nuances that influence user ratings... Believe me, we tried!
[LP]: (*stares in disbelief and walks away thinking that I don't understand this problem and that he would do better...*)
[PT]: (*chugging the rest of my beer*)"

A couple of months ago, that could have been me arguing with someone about the usefulness of external movie data. Team PragmaticTheory was actually founded with the belief that we could do better than other teams because we did not have this pre-conceived notion that movie data was useless. We would implement all the machine learning algorithms, then add some data from various sources... and we would surely beat out the top teams and win the million... Boy, were we wrong!

One of the first things I did on this project was to mine a couple of sites (talk to my lawyers to find out how and which ones) to see if we could get good coverage on the movies in the dataset. I actually did pretty well and we got a good set of movie data to play with. This data was actually useful in the first few weeks. The models using it did better than some of our early pure machine learning algorithms. Unfortunately, as soon as we started implementing some of the more common, documented algorithms, the movie-data-based models got pruned out of the mix. We tried to get a bit fancier and build some more complex algorithms around the movie data. Still, the pure machine learning ones are systematically better.

Why? Well, my interpretation is that movie data is just too black and white. User tastes are infinite shades of grey (think floating point shades of grey). It's not true that someone likes all sci-fi movies. And no one can enjoy all the Tom Hanks movies equally. But the algorithms can figure out the subtle nuances that define user rating patterns. It can figure out that you really like sci-fi comedies that have a happy ending, but that you enjoy the sci-fi/horror genre, where one of the main characters dies, a bit less. It can also figure out that you're a huge fan of Tom Hanks, but that you hate sappy girly flicks... so even if your favorite man is there, there's no saving Sleepless In Seattle and You've Got Mail from being sent to the junk pile.

My explanation is a bit simplistic, but honestly, to anyone out there that still has any doubts that extra movie data may be useful to predict user ratings, I say that you have to have faith in the machine. It's just smarter than we are.

19 comments:

Yehuda Koren said...

Hi,

I really liked your post!

We also tried incorporating an extensive set of movie attributes, but this was completely useless.

It is very likely that user-demographics will be more useful, as there are far less ratings per user, and also the test set is sampled uniformly across users, but dominated by the popular movies. (However, users are anonymized...).

Best,
Yehuda

My Brain and His Chain said...

Hah!

I've had several conversations with LP, including the one in your post. LP has insights like that a lot. Just last week he was pointing out to me how different people have different rating habits, like how some people give tons of 5 star ratings, whereas other people will give 3 stars even for movies they liked. LP says I should find a way to incorporate that into my algorithms.

LP's a good guy, and well-meaning... I also thought I understood the competition a lot better before I actually entered...

Grandpa said...

You've got mail is a firecracker of a motion picture and if your fancy transistors and computin machines can't figure that out, they ain't worth the vacuum tubes they're made of!

JS said...

Hello PT,

Does this mean that Anand Rajaraman doesn't know what he's talking about?

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

Regards,
John S.

Eric said...

@JS

The first commenter on this post is currently #1 on the netflix prize leader board.

I suspect that he does know what he's talking about.

m_eiman said...

Well, I suppose that it shows that letting someone choose five or ten movies they like will tell you a great deal more about their movie taste than looking at their age, sex etc will ever do.

With a large sample set, looking at similarities in selection sets will be more useful than trying to attach metadata to the items in the sets. It's more interesting to known that there IS a strong correlation between items A, B and C than to know exactly WHY there is a correlation.

Since human psychology is as fickle and interdependent as it is, you'll need to add so much metadata for each movie (lots more than a list of actors and genre, e.g. a recession will probably affect the taste of movie watchers) that processing the extra data will take too much time for it to be useful.

Paul Dovydaitis said...

@JS, I remember reading that post a while back - but recall the bit:

"Of course, you have to be judicious in your choice of the data to add to your data set."

What PT is saying makes sense to me, since using a set of movie attributes from an external site assumes that every user agrees with the attributes applied to the movies they've rated. It sounds like this must be false in sufficiently many cases to make this a poor assumption.

Nathan Kurz said...

Hey Pragmatic ---

I'm basically a lay person, but one who's been following the contest very closely. Unfortunately, despite your warnings, I'm still tempted by the siren song of metadata. Could you offer some more specifics on what you tried, and how it failed? Or alternatively, could you offer some more musing on why a KNN/SVD/RBM combo already covers the same territory that the movie metadata would?

For instance, my instinct is that this trio is never going to be able to capture certain viewer idiosyncrasies that go counter to the mainstream. Is the problem that the RMSE is mainly due to the users with very low numbers of movies rated, and hence there is also no hope for a feature based system?

Trying to keep the faith,

Nathan Kurz
nate@verse.com

TR Teller said...

"honestly, to anyone out there that still has any doubts that extra movie data may be useful to predict user ratings, I say that you have to have faith in the machine. It's just smarter than we are."

Um, no.

You just have no idea what the useful "extra movie data" might be.

To be blunt, forget about "extra" -- you don't even know what the useful BASIC movie data might be.

If you did, you wouldn't be a coder. You'd be a studio exec, director, or an EP or something. An entertainment professional, at any rate.

I know it's a difficult feat of mental yoga to try imagine the dimensions of all you don't know, peeps, but trust me, if you haven't owned executive responsibility for a career's worth of entertainment product, you don't know.

You can't know. Know what? The vast set of counter-intuitive entertainment rules: all the things no audience member would guess, understand, or believe, but every career entertainer is forced to understand.

These rules are not pretty, they're rarely politically correct, and they often go unspoken and certainly unpublished.

IOW, it's an art.

I'm not sure there's a way for a programmer to read themselves out of their fundamental production ignorance w/r/t entertainment, but I suppose consuming books on the arts and histories of film writing, development, casting, directing, shooting, editing, production management, post-production, marketing, and releasing couldn't hurt.

In short, it's an endless slog and probably way outside your bailiwick.

Sorry to be the bearer of bad news, but I'm betting the Netflix prize goes unclaimed for a LOOOOONG time.

Greg Weber, StreamSend Engineer said...

Where did you mine your data from, and how did you reconcile it with the netflix data?

[ dut ] said...

hmm. i didn't realize the algorithms were that tight. that explains why they're so interested in such small improvements. i might have to look into some netflix resources...

i'm assuming most people are already weighting the date - such that if i was briefly into kung fu movies when dating the thai boxer last summer, i won't keep getting the same kung fu reccomendations forever. or to take into account that the first time a user signs in they engage in mass-rating of movies they haven't seen in a while (and therefore might rate differently) whereas the movies they get and rate over more spread out periods likely indicate that they're rating as they see new movies.

what about people rating movies they haven't seen? like, i've never seen titanic and never will, but unless i rate the movie a "1" it's likely to show up again and again as i watch kate winslet movies.

the biggest gain i can see coming from netflix recommendation engine can't even be computed from the data set you're given. think about it this way - you're shown a recommendations page with 10 entries on it. you rate one of them as having seen before, add 5 of them to the queue, and mark 3 as "not interested". does the system un-recommend related movies that i'm probably also not interested in?

<brain_dump>

Todd Hoff said...

Psychologically people are very poor at predicting which decision paths will make them happier. So it's probable most people have little insight as to what movies they'll like or why they like them. It's all buried in the subconscious and the conscious only provides post hoc rationalizations of how they ended up feeling about a movie. Adding more and more metadata is unlikely to make explicable what is hidden deep in the mind.

PragmaticTheory said...

Aaaaah that's more like it!! I was wondering why this post didn't yield more comments...

Let me try to answer some of the questions and comments directed my way...

JS: I read that article a few months back and it was actually very encouraging to me because I was working on this data mining stuff at the time... Now, after my experiments, I think that M. Rajaraman is suffering a bit from the teacher dogma syndrome (i.e. it wouldn't be very good for his data mining class if the students that did some data mining were doing worse than the others)... if not, then, hey, I'm willing to be convinced: where is this team that is "close to the best results on the Netflix leaderboard" using IMDb genres???

Nathan: All three algorithms that you mention can capture some parts of what meta data could and to a much more precise degree... As m_einman and todd hoff so eloquently put it, meta data is a way for us to try to explain and put words on the ratings, but that is not the goal of this contest. If the algorithms can find actual strong correlations within the population, that is more useful to predict other users ratings than force feeding hard coded correlations. I am working on a blog post which has a slightly different feel than this one, but which gives some additional insight on SVDs and how powerful they can be.

Greg: I'm not sure that the sites that I mined like it too much when people do that sort of thing, so I would rather not name names here. I simply used the movie names and years to correlate. It wasn't perfect, there was some sparness, and I had to check some entries by hand, but it was good enough to run enough tests to convince myself that spending additional time on this approach was not a good idea.

tr teller: shoot... you got me... out of all the people who read my blog and those that left comments, you're the only one who figured out that it wasn't about observations I made while participating in a silly computing contest... you saw right through that and unmasked my master plan to take over the movie, entertainement industry and the world with my evil machines mouahahaha MOUAHAHAHAH mou ha ha. Well, I'm glad someone was paying attention. I guess I'll just have to go back to my cave and write some lines of code. After all, that's all I'm good for.

Anonymous said...

PT,

Have you considered the possibility that all the predictability has already been extracted by the current algorithms? Meaning that the rest is just noise and sheer randomness. I don't want to sound pessimistic at all, but so many factors are just random (for example, the viewer was sick the day he saw the movie)...

Good luck!

Anonymous said...

Its interesting, on one hand I totally agree with you that it make sense that the "machine is smarter than IMDB generes". However, at the very non-scietific LP kind of way, I always found that the previews in a movie are a great predictor to the movie quality in my taste. That means that the movie studios are somehow able to have a mental image of my preferences in a way that the netflix recommendation system still lacks.

Perhaps there is something about human ability in pattern recognition afterall.

Nathan Kurz said...

Pragmatic writes:
> I am working on a blog post which has a
> slightly different feel than this one,
> but which gives some additional insight
> on SVDs and how powerful they can be.

Thanks for the response. I'll look forward to your SVD post. The fact that SVDs of an extremely sparse matrix are useful as they are at prediction is somewhat of a wonder to me. Instinctively, I'd think that overfitting would be more of a problem than it turns out to be.

Or perhaps it is a problem, and that's why the multi-model blending helps as much as it does. I have some ruminations on that here:
http://www.netflixprize.com/community/viewtopic.php?id=1021
The lack of response has me fearing that what I said is either patently obvious or obviously nonsense.

Thanks!

Nathan Kurz
nate@verse.com

Anonymous said...

The studio execs fund movies which are PG type because that will attract the most audience in the theaters.

Thus movies preferred by movie buffs are not produced with the same amount of money.

I just hope they take notice of The Dark Knight, and make more movies like the Machinist, American Psycho, Memento, Falling Down...

Ok now to examine a movie which was like the above with the same quality (sound + story goes fast + very interesting):

Last Exit (2006)

http://www.imdb.com/title/tt0488887/

This made for TV movie did not even get a DVD release. No Theater time. You can't even watch the movie online unless you download from a Warez Site.

And whatever happened to some TV shows that were canceled because the studio execs uses Nelson system which doesn't record every TV that is viewing the series.

eg. Peter Benchley's Amazon

http://www.imdb.com/title/tt0205737/

(Also note NO dvd release)

What they did was take that series and made a pathetic remake series called Lost.

Ben said...

I thought about this for a bit, but the more I think about it the more I realize the machine-generated factors overcome limitations you just couldn't get by with other sources. Even a site that tries to collect the professional reviews of multiple sources suffers from depressed reviews for movies that aren't "fashionable." Meanwhile, a computer analyzing the Netflix data can come up with a factor for social pressures of various kinds without even knowing what the word "fashionable" means. It doesn't even matter if I tend to lie about my ratings or put them in truthfully, as there will be patterns among even the people who want to lie to make their favorites look better.

Though from my own nosy viewpoint I'd love to have a service that shows me where I score on various factors, but I suppose that would probably end up offending someone, or even worse guiding them to stuff they only THOUGHT they wanted and didn't honestly like! Youch.

Anonymous said...

If NetFlix isn't going to assign the man-hours needed to watch every movie from beginning to end and fill out all the possible meta-data for it then using an algorithm that relies on meta-data will be of no use to NetFlix.

This project is not the video equivalent of the Music Genome Project.