David Morrison’s recent posts on agreement between wine critics (amateur and professional) got me thinking about what it would be reasonable to expect from wine evaluations that use scores. Wine critics are like everyone else. We have different backgrounds and histories, different educational experiences, different preferences, different sensitivity thresholds for flavor compounds, good days and bad days, different susceptibilities to distractions, etc. The discipline of wine tasting over many years would surely mitigate some of those differences to a degree but only to a degree.
This is compounded by the inherent problems with wine scores that I’ve discussed here.
So how much agreement is it reasonable to expect? Many people claim, myself included, that while critics might disagree about the precise quality of wine, there should be broad agreement about which quality level a wine belongs in. No experienced critic should put Two Buck Chuck or Crane Lake in the category of a classic, great wine or put Lafitte in the category with bottom shelf Merlot from Walmart. In other words, critics should agree about basic quality level while disagreeing about gradations within a quality level. This is because broad criteria such as complexity, intensity, balance, flaws, or typicity are reasonably objective standards for people who know what to look for.
How does that hypothesis fare with regard to Morrison’s data? It’s hard to find data where wine critics go head to head on the same bottles. But David found one—the 1996 Cabernet challenge in which James Suckling and James Laube tasted 10 California Cabs and 10 Bordeaux wines from two vintages. 40 bottles is still a small sample but it is what we have to work with. The results according to David:
To me, there is a wide spread of points in the graph — the scores differ by up to 9 points, with 5 of the bottles differing by more than 6 points. Furthermore, the mathematical correlation indicates only 29% agreement between the two sets of scores.
29% is dreadful. Even when agreement is defined as plus or minus 2 pts., allowing for close scores to count as agreement, the percentage is only 58% agreement.
But this is predictable. Given the individual differences noted above we would expect substantial disagreement in the range of plus or minus 2 points.
Happily, David adds a further graph that measures the degree of agreement in placing wine in quality categories. He uses the Wine Spectator’s categories:
95 – 100 Classic: a great wine
90 – 94 Outstanding: a wind of superior character and style
85 – 89 Very good: a wine with special qualities
80 – 84 Good: a solid well-made wine
75 – 79 Mediocre: a drinkable wine that may have minor flaws
50 – 74 Not recommended
Here is the graph:
At first blush this doesn’t look good for my thesis that we should find substantial agreement on basic quality. Bottles within the boxes are wines in which the two critics agree about basic quality level. But only 25 of the 40 wines appear in the boxes, 63% agreement. Not so good.
But I think we need to take a second look at this. If it is unreasonable to expect less than plus or minus 2 pts for wines within a quality level, it is unreasonable to expect less than plus or minus 2 pts for wines between quality levels. In other words, if Laube assigns 90 points to a wine and Suckling assigns 88 points to the same wine, this chart treats them as disagreeing about basic quality even though the disagreement is only two points. The cut off points between quality levels are rather arbitrary.
It seems to me the right way to look at this is to allow some imprecision on the borders of the quality levels as well. I spot 3 wines just outside the boxes where the discrepancy is plus or minus 2. That moves the agreement to 28 out of 40 wines or 70% agreement. If we move the margin to plus or minus 3 pts. we get an additional 4 wines within the boxes. That’s 32 of 40 or 80% agreement.
Which brings me back to my original question. How much agreement should we expect? It seems to me given the inherent lack of precision in the assignment of scores, a disagreement of 3 points on basic quality is a reasonable expectation. On that standard the 80% level of agreement on basic quality is about what we should expect. After all the criteria we use to determine basic quality—complexity, intensity, balance, typicity, obvious flaws etc.—are subject to some of the same individual differences that other measure of wine quality are subject to. Disciplined tasting over many years should mitigate them to some extent but can’t eliminate them.
But an 80% rate of agreement within a range of 3 pts. should set aside notions that it’s all subjective and there is no such thing as wine quality. If that we true the 80% agreement would be hard to explain.