Reddit's attention inequality

Posts on Reddit can be upvoted or downvoted and comments can be placed. But how is this attention distributed? Does each post get their share, or does the top 1% get 99% of all the upvotes and comments?

Calculating what percentage of Reddit submissions gets ignored is easy: just download all Reddit submissions from 2006 to August 2015 and count the number of submissions that have a vote score of +1 (the original poster's (OP) own automatic upvote) and zero comments. On my machine, I've put the data into an SQLite database for easy querying:

sqlite> SELECT count(*) FROM submissions WHERE num_comments=0 AND score=1;

Now divide by total submission count:

sqlite> SELECT count(*) FROM submissions;

46403581 / 196500000 = 23.615053944%

So about 24% of Reddit submissions are completely ignored. The fact that some posts get more attention than others shouldn't come as a shock, that's just how (and why) Reddit works as a news aggregator and social platform. But how is attention distributed amongst the other 76%? What does the curve from obscurity to virality look like?

A while ago I posted a frequency distribution of upvotes/score for all Reddit submissions from 2006 to August 2015:

35% of Reddit submissions have 1 upvote

I know the log-log scale isn't very intuitive for the uninitiated, but I was pleased to find that the guys at /r/dataisbeautiful liked it regardless. I considered plotting it with the standard linear axes, but decided against it because, well, here it is:

Awfully uninformative linearly scaled plot

Awfully uninformative.

All the datapoints hug the x and y axes. That's because the data roughly follows a power law, and a pretty strong one, too. Most points have values below 1000, but there's a few really high values. If we zoom out to fit them all in the plot, all the lower value points (where most of the juicy information is) get smushed together.

That's where the logarithmic scale shines. It pulls apart the lower values and squishes together the higher ones so that, in the case of a power law, it redistributes space on the plot more evenly, allowing us to see detailed differences across orders of magnitude.

I'm a big fan of log-log plots, but there 's another reason that plot wasn't very intuitive. "Frequency" is a confusing term here, and that axis is probably better expressed as a percentage anyway. For added shoutability at political rallies, I've made it a cumulative percentage, meaning each value on the x-axis represents the top x% of submissions. I've also added comment count to the mix.

Edit December 9th, 2016: These numbers were gathered before Reddit retroactively changed how voting works, so today's situation will probably look different. Most notably, soft-capping very high scores is no longer a thing.

Distribution of score and comments of all Reddit submissions

Neat! Comment count and score follow a similar curve (as you might expect, since comment count and score are correlated, as we'll see in a bit).

If you're confused, here's one way to read this graph: if you look at where the score line reaches 1,000 on the y-axis, and then check where that is on the x-axis, you'll see it's about 0.6% (remember this is a logarithmic scale). What this means is that only about 0.6% of all submissions have a score of 1,000 or higher. Getting at least 1,000 comments is harder: only 0.06% of submissions get there — ten times as few.

By the time the score line reaches 10,000 on the y-axis, it belongs to the top 0.0001% of Reddit. This means only 1 in every 1,000,000 submissions get a score of 10,000 or more. However, getting 10,000 comments is actually easier: 0.002% of submissions (1 in 50,000) get there. The chances of getting a higher score or higher comment count cross over at around 6000 on the y-axis. It seems like score maxes out at this point and it's exceedingly difficult to get into the ten-thousands, while comment count keeps on truckin'. This is probably Reddit's soft score cap in action. It checks out if you take a look at /r/all: the scores of top submissions tend to hover between 4,000 and 6,000.

Two things stand out in the smaller inset:

It seems a lot of Reddit submissions go unnoticed. They start with a score of 1 (the OP's automatic upvote) and no comments and stay that way as they quickly fade into oblivion.

If you remember what you learned in high school about probabilities, you might be tempted to calculate that any given Reddit submission has a 0.35*0.42*100 = 14.7% chance of having 0 comments and a score of 1 (i.e. being completely ignored). Wait, that's not even close to the 24% we got earlier?!

Hold on! That kinda math only works if the probablities involved are completely independent, i.e. uncorrelated (meaning one doesn't influence the other, the same way winning the lottery today doesn't change your chance of winning tomorrow). Have you ever seen a front page post with 0 comments? No way upvotes and comments aren't correlated. Let's check by plotting each submission's score versus its comment count:

Amorphous blobs are SO INFORMATIVE!

I love looking at scatterplots that look like amorphous blobs. Maybe it's because I spent so much time squinting at amorphous blobs as a molecular biology student. Or perhaps it's because you can get so much information from them.

We already know 24% of all submissions have a score of 1 and 0 comments, but this plot shows that by far the majority of the remaining 76% don't fair much better. Those that rise above a score of 3,000 and/or a comment count of 500 appear only as a thin haze. If you think about it, this graph also serves as a 2D histogram. If you were to collapse all points into the x-axis, you'd get a standard score histogram. You can get the same info by looking at the density of the point cloud, with the added bonus of being able to look at both comment count and score at the same time. This allows you to see the shape of the curve from obscurity in the lower left to virality in the upper right, and how the relation between comment count and score comes into play.

For example: submissions with over 500 comments are quite common, but only if they have a score of under 500. Also, there's a surprising amount of submissions with a score of up to 2,500, yet only a handful of comments.

You can find some interesting finer details, too. There's a vertical line near the left edge showing lots of submissions with only about a dozen upvotes have a comment thread going into the thousands. Perhaps these are very controversial posts that spark huge discussions with a very divided audience (leading to a balanced upvote/downvote count). That splotch at 1,000 comments along this line is probably /r/counting. There's another smaller splotch below it at around 700 comments — anyone know what causes that? Edit May 4th: Turns out that's /r/counting's Letters Thread. Thanks, /u/Mooraell!

Note: the graph actually goes on for much farther than is displayed — the most upvoted submission in the dataset has a score of 56,263!

Anyway, it may not be immediately obvious, but number of comments and score are definitely correlated (Pearson correlation coefficient of 0.36). Most of that correlation is in the lower values, which are all squished together into a big blob. Let's pull those apart using logarithmic axes:

Sweet nectar from the log-log tree

Aaah, that's the stuff!

The correlation is much more apparent here. And again, you can see /r/counting making its mark at 10^3 = 1000 comments per submission in the 20 to 30 upvotes range. Also, this graph contains all 196.5 million submissions — there's nothing beyond the edges.

If you're wondering why you see those lines at the bottom left, that's because scores and comment counts can only be whole numbers (integer values). The logarithmic scale has pulled apart the lower numbers, showing the empty space between them (10^0 = 1, and 10^1 = 10, for reference).

So now we have a visual representation of how much attention (measured in comments and upvotes) a submission gets. But unless you're one of those people who like to carry around charts, you probably want some impressive-sounding numbers to chant at political rallies.

If we define "attention" as score + comment count, we can calculate what share of Reddit's total "attention" the top 1% of Reddit submissions gets:

import numpy as np
import sqlite3
conn = sqlite3.connect('reddit_submissions.sqlite')
c = conn.cursor()
c.execute('''SELECT score, num_comments FROM submissions;''')
OH_THE_REDDANITY = np.array(c.fetchall())
attention = np.sum(OH_THE_REDDANITY, axis=1) # sum across columns (score + comment count)
attention = np.sort(attention)[::-1] # sort from highest to lowest
the_one_percent = np.sum(attention[:len(attention)*0.01]) # sum attention of top 1%
print the_one_percent / np.sum(attention) * 100 # divide by total attention to get percentage

So the top 1% of Reddit submissions gets 48% of the attention. Not quite as impressive as Bernie's numbers, but it might turn some heads.

The question is: is this a bad thing, or is Reddit working as intended? The traditional spouting of strongly worded opinions on the internet to establish dominance is left as an exercise to the reader.

Edit May 4th, 2016: /u/minimaxir wanted to look into the over-saturated part of the amorphous blob, so here's a 2D histogram where color represents density. Note that even though I used a logarithmic color scale, I still had to cap it at 10^4 to prevent the bottom left bin from drowning out the rest:

And here's the same thing with log-log axes:

← back to blog