Bothered by the idea of judges’ subjective opinions determining the best performances in Olympic events, a Minitab statistician analyzes the results of two events to evaluate whether or not the judges were consistent and fair.
by Joel Smith, senior business development representative
At about 3 p.m. on July 29, two female divers will step to the edge of side-by-side three meter springboards and in synchronized formation they will flip, twist, and turn through the air before entering the pool at the London Aquatics Center with as little splash as possible. A panel of judges will independently evaluate how well they executed the dive, and their scores will be reported to all watching. Thus will begin the diving portion of the 2012 Olympic Games, one of several events (others include gymnastics and synchronized swimming) where winners are determined by judges rather than stopwatches, measuring tape, or a scoreboard. But have you ever wondered how accurate those judges are?
Most statisticians, Six Sigma practitioners, and quality professionals know to evaluate their measurement system before making any decisions based on their data. This is, of course, because we want to trust the data and the information it gives us. Long bothered by the idea of subjective judges’ opinions determining the best performances in Olympic events, I decided to analyze—at length (now would be a good time to grab a cup of coffee before proceeding)—the results of two events to evaluate whether or not the judges were consistent and fair:
You can read the details about how this event is organized and the results here, but for our purposes, all you need to know is that each diver in the finals performs six dives, with each dive having a degree of difficulty assigned to it. The degree of difficulty is based on the dive and is not judged; for example, a triple-flailing belly flop has a degree of difficulty of 0.1 regardless of who performs it or how well they execute it. Seven judges evaluate how well each dive was executed. Although it will not affect our analysis, the highest and lowest scores for each dive are dropped and the remaining are multiplied by the degree of difficulty for the total score of that dive. For example, here is how one diver’s first dive is recorded in the worksheet:
I am going to analyze the data with a General Linear Model. My response is Score, and here are the terms I will begin with:
Here is my ANOVA table before any terms are removed:
If you are looking to bolster your faith that judges do a good job, this is exactly the kind of result you want to see. All terms involving a judge are insignificant (in fact, none are even close to significant), so we see no evidence that certain judges favor certain divers (the Diver*Judge interaction), that certain judges are better or worse at judging easier or harder dives (the Judge*Difficulty interaction), or that certain judges grade higher or lower relative to others (the Judge term).
After removing those terms and rerunning the analysis, we get the following ANOVA table:
All terms are significant and judges have no effect, but the R-Sq(adj) of 50.17% doesn’t exactly give me a warm and fuzzy feeling inside. However, in the initial analysis I did not include the Diver*Dive interaction because of confounding with Difficulty (for example, Thomas Daley’s first dive always has the same difficulty). But we can incorporate that same Difficulty information by including a Diver*Dive interaction term while additionally accounting for each diver’s variability between dives. So after removing the Difficulty term and Diver*Difficulty interaction and instead including the Diver*Dive interaction, we get the following ANOVA table:
Now we are left with three very significant terms—none of which are an effect from judges—and a very satisfying R-Sq(adj) of 89.63%. This is exactly what you want from a judged competition in terms of judges not exhibiting biases, and great news for those two synchronized divers kicking things off in London!
You can read a more complete account of the judging scandal for this event here, but I’ll provide a brief summary using the contestants’ countries instead of their lengthy-to-type and difficult-to-pronounce actual names. The Russian team had a small lead over the Canadians after the first of two programs. The Canadians chose an easier routine than the Russians for the second program but executed it flawlessly, while the Russian team made an error during theirs. The crowd onsite as well as live television commentators overwhelmingly believed the Canadians had just won gold—but when results were announced they still needed an additional 0.1 points (on a scale from 1 to 6) from any single judge to catch the Russians.
The Canadians won silver instead of gold, but it was reported soon after that the French judge had admitted—twice—that she had been pressured to vote for the Russians over the Canadians regardless of performance (she later recanted). So could there be evidence of judge bias in this event?
In figure skating, pairs are judged on “Technical Merit” (TM) and “Artistic Impression” (AI) and there is no difficulty rating, so here is how our data is structured:
For example, Judge #1 gave the Russian skaters Berezhnaya and Sikharulidze a 5.8 for Artistic Impression.
As we did in the diving analysis, we’ll use General Linear Model to analyze the data. We’ll include the following terms in the initial model:
Here is the resulting ANOVA table:
If the first skaters on the ice in the 2014 Winter Olympics in Sochi feel a little less confident than our synchronized divers, it would appear they have a good reason! Before diving a little deeper, here are a couple of takeaways we can immediately see in the ANOVA table:
It is worth noting that based on the Sum of Squares, the majority of variation in the data is from the difference between skating pairs and not from the judges, with the true difference between pairs (the equivalent of part-to-part variation in Gage R&R) accounting for 96.8% of the variation in scores. But even so, it is troubling to find judging bias, as those biases represent a lack of fairness and true competition.
Remember that one judge admitted being pressured on her ratings of two particular teams. Assuming that is true, it should not be surprising that at least the Judge*Name interaction was significant. To evaluate unusual ratings, I stored the coefficients for each combination of judge and skating pair, and plotted them. Values close to 0 indicate the judge does not exhibit bias towards that pair. More positive values indicate the judge is biased towards scoring them unusually high, while more negative values indicate the judge is biased towards scoring them low. Highlighted are the questionable judge’s (Judge #4) coefficients for the Russian and Canadian pairs:
It appears that the French judge likely exhibited no bias towards the Russian team but some negative bias towards the Canadian team—in fact, if the bias towards the Canadians was eliminated, the Canadians likely would have tied the Russians.
But the bigger picture is how biased judges are generally. Remember that the points marked are for a judge who admitted being biased towards those two teams, and consider how little they differ from one another relative to judges in general. Several points fall at roughly 0.2 points above or below 0, which, given that those judges evaluate on two different criteria and two performances, can give a pair an enormous advantage or disadvantage. Why? The coefficient is 0.2, but that must be multiplied by the number of scores the judge gives that skater (2 categories times 2 performances in the finals = 4 scores). So the 0.2 coefficient translates to a 0.8 point swing in that pair’s final score. Remember, the Russian team captured gold over the Canadian team by a mere 0.1 points!
So, what have we learned?
At about 3 p.m. on July 29, two female divers will step to the edge of side-by-side three meter springboards and in synchronized formation they will flip, twist, and turn through the air before entering the pool at the London Aquatics Center with as little splash as possible. Based on what we’ve learned in this analysis, a panel of judges will independently—and fairly—evaluate how well they executed the dive and scores will be reported to all watching.
Eighteen months later, a pair of figure skaters will step onto the ice in Sochi and execute a series of jumps, spins, and choreographed moves in a display of athleticism and precision. A panel of judges will also independently evaluate how well they executed the routine and scores will be reported to all watching…but will they be fair? Will we be able to trust the data?
________________________________
Joel Smith, statistician and senior business development representative at Minitab Inc., works with Six Sigma and quality improvement consultants and partners to develop new opportunities for the use of Minitab software products. He has worked with numerous companies on process improvement projects and initiative deployments. Smith enjoys sharing his knowledge of data analysis and process improvement, and has become known for delivering enlightening and entertaining talks at national and regional quality conferences. Smith joined Minitab in 2004, and since then has worked as a statistician and Six Sigma specialist in the company’s technical support, commercial sales, and business development departments. He earned a bachelor’s degree in chemical engineering from Rose-Hulman Institute of Technology and a master’s degree in statistics from Virginia Tech. Smith is a certified Lean Six Sigma Master Black Belt.
Diving image used under Creative Commons Attribution ShareAlike 3.0 license.
Get our free monthly e-newsletter for the latest Minitab news, tutorials, case studies, statistics tips and other helpful information.
Data is the new gold: 5 ways to make sure your data is reliable
Advancing the Power of Analytics
A Statistical Analysis of Boston’s 2015 Record Snowfall
By using this site you agree to the use of cookies for analytics and personalized content in accordance with our Policy.