Olympic medals vs GDP & population since 1896
As a follow-up to my previous post about the 2016 Olympics, I've collected data about all countries that have participated in the Olympics since the first one was held in its modern form in 1896 and put it all in a nice interactive graph.
R² is a measure of the how well the linear regression model explains the variance in data. In other words: how well does the chosen variable (population, GDP, etc) predict each country's medal performance? It ranges from zero (meaning not at all) to one (perfect accuracy).
My initial suspicion about population predicting Olympic performance was already debunked, but this shows even more clearly it's a very weak correlation. However, GDP is a much better predictor, while GDP per capita isn't at all! The only difference between these two is GDP per capita has removed the factor of population, so the effect of population mustn't be underestimated after all. It varies from year to year, but the overall trend appears to be solid.
Since we have data across all the Games now, we can compare how countries have performed in a given year compared to their performance in other years. I've dubbed this metric 'weighted medals across all Games' in the graph, and it was calculated as follows:
- For every Olympic Games, calculate percentage of medals (weighted) won for each country (note that this means every Olympics contributes equally, regardless of how big the medal pool was)
- Take the mean across Games for each country
- Normalize to percentage of total medal pool
Going through all Olympic Games in the graph above, it seems this is the best predictor of the three. It makes sense intuitively: we know certain countries tend to be in the top, say, 5% of medal earners. What's interesting is that it seems to hold for the lower 95% as well.
A note on log-log linear regression
Ordinary least-squares linear regression analysis was done only on the log-log transformed data. This is because the original, untransformed data doesn't satisfy the assumption of homoscedasticity for most Olympic Games. In mathematical terms, the variance of the error of the line I've drawn isn't constant. Take a look at population for the 2016 Rio Olympics, for example, and turn off the log-log check-box. You'll see most of the countries bunched up together in the lower left corner, with only a few on the right. If I were to plot a line through that, it would give me a very low error because most countries would be very close to the line. But that would be unfair! They only appear close to the line because the few ones on the right are much farther away, relatively. If we transform the data by taking the 10-base logarithm of each axis, the datapoints spread out nicely. And because we're taking the log of both axes, a linear fit still works.
Ranking countries' all-time performance
Here's a ranked list of all the countries with their average share of weighted medals per year (as a percentage of the total medal pool) from 1896 to 2016. This includes both Winter and Summer Games.
|Rank||Country||Mean medal share|
|72||Trinidad & Tobago||0.028%|
Some notes about the data:
GDP inconsistencies across Games: If you look closely, there is a big change in GDP and GDP per capita for some countries when you go from 2008 to 2010. This is because I used two different sources to gather this data and, apparently, they measure GDP somewhat differently. I'm not too bothered by this because the goal is to compare countries' GDP per Olympic Games, not across Olympic Games.
Didn't separate Winter and Summer games: It's clear some countries perform better at either the Winter or Summer Olympics, yet I didn't separate them when calculating weighted medals across all Games, giving a somewhat unfair view of each country's average performance. I'd like to separate them, but prior to 1996 the Winter and Summer Games were held the same year and both were included in the year's total medal count, so most of the Games have mixed Summer/Winter data anyway and there's little I can do about it. I could split them up starting in 1996, but 5 data points is on the low side for generalizing.
Notes about the cleaning of the data:
- Where GDP for a particular year was unavailable, a linear interpolation between the next closest values was used
- Luxembourg, Bohemia, the Bahamas, Tonga, Barbados, Iceland, the United Arab Emirates, Chinese Taipei, and Liechtenstein are not included for some years because no GDP data was available
- East and West Germany are not included because no separate GDP data was available
- Estonia is not included before 1973 because GDP data before that year was unavailable
- Cuba is not included before 1929 because no GDP data before that year was available
- GDP data for Korea after 2008 was not available
- Mixed/International teams are not included
- No GDP data was available for North Korea or Egypt after 2008, so these are displayed only when population is the selected metric
- Medals 1896 to 2008: http://www.databaseolympics.com
- GDP and population 1896 to 2008: MADDISON dataset
- Medals 2010 to 2016: Wikipedia medal count pages (e.g. the 2014 Winter Olympics in Socchi)
- GDP 2008 to 2014: World Bank
- Population 2008 to 2014: World Bank
- GDP estimate for 2016: International Monetary Fund
- Population estimate 2016: Wikipedia's list of countries per population (accessed August 29th, 2016)