On the WSJ, Carl Bialik writes a very interesting column (and blog) called The Numbers Guy, both of which talk about the use and misuse of numbers to convey information (or psuedo-information).
This morning’s column (inside the paywall) talks about the controversy within the movie critic profession over the use of “star” ratings, and he elaborates on it (outside the paywall) in his blog. In addition to the impact on user generated content (more later), the column caught my attention as the former Arts Dept. editor for The Tech, the MIT student paper.
Both address the issue of reducing everything to one number. At The Tech, we used turkeys instead of stars, reverse coded from 0-5 (no turkeys is best). Other systems use 1-4 or 1-5 stars. So if a movie wins 3/5 stars, is it an average movie across the board, or is it a movie that’s great on some aspects (e.g. performances) and terrible on others (script)?
So this problem is one of “good” being a multi-dimensional measure. Hotels.com and other consumer rating sites seem to be able to compile these multiple dimensions on their UGC and allow buyers to see ratings on the dimensions that matter to them.
Hotels.com, Consumer Reports and others also have the issue of norming for price. Does 4 stars mean the same for a $50 motel and a $500 hotel? (Or for a $15,000 or $50,000 car). For the AAA diamond system, it’s an absolute scale, but Hotels.com seems to want the rating normed based on value.
As Bialik notes in his column (and an earlier column) the problem at the next level of analysis: aggregating the ratings of multiple reviewers, as is done by Rotten Tomatoes and Metacritic. Does a 3/5 mean everyone gave it a 3, a bell-curve distribution around 3, or a bimodal distribution of all 1 or all 5?
This is nicely (& easily) handled by Amazon, which gives both a mean and a histogram. This works particularly well for polarizing authors like Al Franken or Ann Coulter. However, it doesn’t fix the sampling bias issue — the people who self-select to write a review are not representative of the reading public overall (an issue Bailik notes in his paid article).
One issue that Bialik doesn’t address is weighting the average. For movies, critics who see dozens or hundreds of movies a year usually reward things that “push the envelope,” which normally means some sort of aberrant production values (Blair Witch), script (Curious Case of Benjamin Button) or characters (Boys Don’t Cry). Others are excited by the craft — the production values, acting performance, directing — of the sort that win Oscar® awards.
However, I am plunking down $10 2-4 times a year to be entertained. I want something that I enjoy watching, not something that pushes the envelope. A good example was “You’ve Got Mail,” which is a very nicely done romantic comedy — the ideal date movie — but only got a 62% critic rating. Sure, the plot was predictable, and the ending was pre-ordained, but the character quirks and plot twists were believable and at times funny: not as timeless as Tracey-Hepburn, but as close as a modern-day rendition as Hollywood offers nowadays.
So even if we have an accurate, low variance attribute rating — everyone agrees this is a predictable plot — it doesn’t solve the problem that buyers differ on how important that is. The problem was solved 30+ year ago by Fishbein & Ajzen, who described a subjective utility model with different attribute importance ratings (i.e. weights).
To discern what’s important to each buyer, this seems like a job for a neural network or other self-training system. Again, the Amazon recommendation engine handles this: if I like Al Franken, it shows me Bill Maher and Stephen Colbert; if I like Ann Coulter, it shows me Rush Limbaugh and Bill O’Reilly.
Interestingly, Amazon has very different incentives than the movie sites in aggregating ratings. Rotten Tomatoes or Metacritic want me to linger on the site looking at ads. Amazon wants me to buy something that I like, so I’ll buy more. Whether it’s different incentives or a different scale of revenues, Amazon seems to be much more serious about giving recommendations that are accurate for my tastes.
Perhaps what we need is an open source score aggregation system, one that handles
- multiple dimensions or rating
- conveys the distribution of ratings, not just the average
- fits my own opinions to those of reviewers whose opinions most closely match mine, to give me feedback consistent with my tastes, interests and values.