The Big Bang of the Reddiverse: growth in posts per day, broken down by subreddit
Ever since stumbling across this awesome dataset of all Reddit submissions from about 2006 to August 2015, I've been trying to find neat ways of visualizing such a vast amount of data.
I wanted to look at Reddit's growth over time. I already made a simpler plot of posts per day across all of Reddit, and I thought it would be cool to break this down by subreddit. Not just a few of my favorite subreddits, no: all 430 thousand of them.
The dataset stretches from about 2008 (data before that is incomplete) to August 2015. That's 430,434 subreddits over 3,507 days, otherwise know as a shit-ton (metric, mind you) of data. Calculating the daily number of submissions per subreddit is easy enough, but that's still 430,434 * 3507 ≈ 1.5 billion datapoints. That's not going to fit into a single plot without downsampling away 99.9% of the detail. That's not a figure of speech: if I gave each datapoint only 1 pixel in a 1920x1080 image, only 1920 * 1080 / 1.5 billion ≈ 0.1% of them would fit (using a 4K monitor wouldn't change much: 3840 * 2160 / 1.5 billion ≈ 0.5%). My solution? Use lots of images in a sequence; more colloquially known as a 'video'.
In the video below, each pixel represents a subreddit, and brightness represents daily amount of posts. I've sorted the subreddits by age, with the oldest ones sitting in the center and the youngest ones at the edges.
Note that I set brightness to max out at 10 submissions per day so that ludicrously popular subreddits like /r/pics don't drown out the rest. It does turn it into kind of an all-or-nothing affair for the innermost subs, but it shows a lot more interesting detail than setting the visual maximum to the real maximum of the data (I've tried).
Looking at the video, most subreddits are barren; part of the vast black void separating the far and few between that shine. They spawn with a flash of activity (as seen from the bright ring of newly born subreddits snaking its way around), but they don't last long. Some die after holding on for a few days (like most timeline subreddits), some flicker on and off like a broken light bulb, and almost none are ever revived once dead. It is only every so often that a subreddit is born and establishes a community, becoming a permanently shining star in the Reddit universe.
It's clear older subreddits tend to be more active: there's a bright core at the center of the Reddiverse. There are also some weird flashes from time to time of groups of subreddits becoming very active for a short period of time, most notably near the end of the video. I've dug into the data and identified some of the things that caught my eye:
2012-05-06 through 2012-09-17 (top left)
Frontbot subreddits. The bot posted the front page posts for several popular subreddits (e.g. /r/pics) every few hours. It was shut down after a week and a half but I couldn't find why. Perhaps people felt it was messing with Reddit's search feature?
2013-02-16 (to the left)
Wave of spam subreddits. All of them got exactly 61 submissions on the 16th and exactly 81 on the 17th made by the same few accounts (some of which have since been banned), so it looks like the work of a spambot.
2014-02-17 through 2014-05-15
More spambot subreddits in the top right (and some lower right). These ones remain active for several months.
2014-12-20 through 2014-12-23
Another burst of spambot activity, but in older 'sleeper cell' subreddits dating from around February 22nd 2014. Lasts 3 days.
2015-05-30 through 2015-06-13
Huge spambot wave lasting two weeks in two older groups of 'sleeper cell' subreddits : one originally created in March 2014, the other in November 2014 (their creation had already been caught in this /r/dataisbeautiful post).
Interestingly, when you do plot this data in a single image (heavily downsampled), it naturally shows how the number of subreddits grows over time, with daily posts as a sort of bonus in the colormap.
Leaving out the reference scale for the colormap is intentional: since the data is heavily downsampled across the y-axis (subreddits), each value is the average of a few hunderd neighboring (in terms of age) subreddits, so the absolute values have no real meaning anymore. The purpose of the color in this plot is mainly to show large trends between older and newer subreddits.
I'm not sure if this is the best way to show this data. Probably not, depending on what question you're trying to answer. But at the very least, it's a good reminder that line plots, bar charts, and networked node graphs aren't the only visualization options out there. Get creative!