2016 Olympic medal count as a function of population and GDP
While watching the Olympics this year I started wondering if the reason countries such as the USA and China are among the top medal earners was because of their sheer population. Assuming all other factors are negligible (which they probably aren't), it stands to reason that great athletes have a higher chance to be born in a highly populous country. A quick glance at the medal table shows that this cannot be the case, or Great Britain wouldn't have more medals than China. But let's see how far from the truth it is.
Of course, I wasn't the first to ask this question: medalspercapita.com has a nice overview of all the numbers showing that, per capita, countries such as Jamaica and New Zealand are at the top of the list. Let's visualize this and do some ordinary least squares regression, to boot.
Well, there goes the theory. Even when you consider China and India to be outliers, population doesn't explain medal count very well.
FYI: R² is a measure of how well the line fits the data, defined as the explained variance, or the amount of variation in medal count that's explained by population. It ranges from zero to one, with one meaning 100% of the variance is explained (which only happens when all the points lie exactly on the line).
Another thought I (and probably many others before me) had was that richer countries might get more medals. As it happens, medalspercapita.com also shows GDP (gross domestic product) per country. Let's plot that.
That's a much better fit! Plus, there are no obvious outliers.
Of course, GDP itself depends on a country's population. You might think of plotting medals vs. GDP per capita instead. However, you get GDP per capita by dividing GDP (which we've looked at) by population (which we've also looked at), so we can predict what the result of that will roughly be. We know population only has a small effect, while GDP has a large effect. So if we remove population from the mix, it won't change much.
There you are: a little fun fact for armchair statisticians discussing the Olympics.
EDIT August 22nd, 2016:
As suggested by /u/mu_Bru, here's the GDP plot with log-log axes (apologies for not including it straight away, it's such a natural thing to do with this data):
EDIT August 23rd, 2016:
Well, turns out I completely forgot about and thus broke one of the assumptions of least squares regression: homoscedasticity. It's a fancy word stating that the variance of the errors (i.e. the distance from the real values to the fitted line, or how "wrong" the line is) must be constant. If it isn't, you have a nasty case of heteroscedasticity. When the data forms a sort of 'cone' with low values bunched up together and high values more spread out — like the above data clearly has — you can't do least squares regression reliably. Whoops!
One way to get around this is to transform the data such that the error variance becomes (close to) constant. So I've added a new best-fit line to the log-log plot above which was calculated using the log10 of the data on both axes. As you can see, it doesn't fit nearly as well, but it's a much fairer analysis.
Oh well, perhaps it's a good thing Olympic medals are hard to predict!