Introduction to computational/data-driven/digital methods of researching Reddit: 10 ways to explore the data

My friend Jason Baumgartner (stuck_in_the_matrix on Reddit) performs an extremely valuable service in pulling every single comment and post from reddit and storing them on a site that anyone can access (https://files.pushshift.io/reddit/). In contrast with Facebook (who just blocked Netvizz, the main way academics scrape data) and Instagram (who are similarly restrictive about the data you can use), that makes Reddit an extremely open platform for research and analysis.

Given that reddit is one of the most popular sites in the world (particularly in the US), it’s a shame that so little research has been done on its social dynamics. Quick shout-out for existing work:

Adrienne Massanari has written a book, as well as multiple papers on Reddit, with a particular focus on the way it facilitates the creation of “toxic techno cultures” like GamerGate and The Fappening.

Alex Halavais has a paper in First Monday which looks at a number of different subreddits and thinks about how reddit users draw upon evidence.

My own work has been published on Quartz, the New Statesman, and the LSE Impact Blog, using computational methods based on Jason’s data to analyse the linguistic practices of the alt-right and show how the various communities are connected. I do some temporal analysis of the word “cuck”, and also look at the “dictionary” of the alt-right. I’ve also made arguments about why Reddit refuses to ban The_Donald, a notorious source of hate speech. And then there’s my Reddit fanboy article.

If there’s more, let me know and I’ll happily add it.


How you can use computational methods and pushshift.io to analyse Reddit

Anyway, here’s some of the things you can do with Reddit, illustrated through the medium of white supremacist communities.

1. Basic word cloud analysis

You can generate the most commonly used words from a subreddit or author. Here’s The_Donald for the last 90 days:

Word cloud for The_Donald

Word cloud for The_Donald

2. Word clouds based on phrases

You can create word clouds based on which terms are most likely to turn up in the same comment as a word or phrase, like these for “jew” in The_Donald and then the subreddit ChadRight:

Word cloud for "jew" in The_Donald

Word cloud for “jew” in The_Donald

Word cloud for "jew" in r/chadright

Word cloud for “jew” in r/chadright

3. Subreddit word frequency analysis

You can look at which subreddits are using a word or phrase the most, like this graph of the word “cuck” over the last 90 days:

Subreddits most commonly using "cuck" over most recent 90 days

Subreddits most commonly using “cuck” over most recent 90 days

4. Taking account of subreddit size

You can adjust the parameters of these functions, for example normalising the function so that it takes account of the proportion of comments containing the phrase, rather than the absolute number:

Subreddit frequency analysis of "cuck", normalised for subreddit size

Subreddit frequency analysis of “cuck”, normalised for subreddit size

5. Detecting speech patterns in subreddits (e.g. hate speech & slurs)

Using a combination of terms, you can for example compile a list of subreddits which use racial slurs the most. Here’s a normalised graph of the subreddits which use the following terms the most: bitch|cunt|nigger|niggers|fucker|libtard|libtards|cucks

Normalised frequency of slurs by subreddit

Normalised frequency of slurs by subreddit

6. Analysis of authors’ comments & activity

You can take an individual user and look at the subreddits in which they are most active. Let’s use reddit CEO Steve Huffman’s account as an example here:

Steve Huffman (spez) subreddit activity

Steve Huffman (spez) subreddit activity

7. Which users are saying words the most?

You can see which users are using a particular phrase the most. For example, here’s the list of users whose comments most commonly include a racial slur over the last 90 days:

hate speech user frequency racial slur

 

8. Which places do subreddits link to?

You can find which sites are most commonly linked to by a given author or subreddit, seen with The_Donald here:

The_Donald outward links by domain frequency

The_Donald outward links by domain frequency

9. Time period analysis of word frequency

You can look at how frequently a word has been used over time, as with “cuck” here:

Timeline analysis of the frequency of "cuck"

Timeline analysis of the frequency of “cuck”

10. Time of Day analysis

We can look at when someone is posting the most, or when a subreddit is most active, and in doing so potentially deduce what time-zone they are likely to be in (or a plurality of their users are likely in). Here’s The_Donald vs r/politics:

politics time of day post frequency

the_donald time of day post frequency

the_donald time of day post frequency

10. Word association based on Natural Language Processing

Finally (for now), there’s a function called “describe”, which uses Natural Language Processing to find the words and sentences in which a word or phrase has been situated. That might sound confusing, but it’s reasonably intuitive when you see the result. Let’s look at “jews” in The_Donald:

white | not white | descended from swine and apes | massacred in morocco | a treacherous | a bunch of paedophiles | hated by allah to the extent that they are destined for eternal doom as a result of their beliefs | liberals | done on purpose thousands of years ago | simply telling their own to save the nastiness for gentile kids | a nation of liars | their generals | bad | supposed to be back when christ returns at the rapture so he can go back to dealing with them | really touchy when it comes to group criticism | on their way to hell | oppressing us” vs “white people oppressed black people” | inconsequential | white or not | the new fad by the left now | liberal | nazis | the color white | often without a seat when the music stops | no problem

Let’s try “jews” in r/chadright:

in fact | eveil monsters | vastly over-represented in ivy league and are classified as white – that make it harder for non-jewish whites to get into ivy league | satanists and control all the evil in the west | overrepresented in position of power just as feminists are concerned that men are overrepresented in positions of power | just bigoted nonsense | not semite so you can’t be anti-semitic to asheknazi | jewing each other and in the end it may be even worse to them than it is to us | cursed and covered with malediction | the best | arabs | bigoted | so low iq it is incredible | evil monsters who want to rule the worldwe never talked about it | fucking kids as your universal message | oppressed and need to stand up for themselves | not anti-semitic | certainly important but one would have to show they were motivated by their perception of jewish interests | a monolith out to destroy white people as there are nonreligious jews | very  successful in this country | god’s chosen and we don’t deserve the same privilege and wealth they enjoy
 


Computational methods for analysing reddit: how to get involved

So there you have it. Eleven different functions for analysing data from reddit, from 2006 to present. There’s an enormous amount of scope for combining these functions, for example altering time periods, subreddits, words, or authors. It would be really great to see more people using these methods to analyse reddit. If you’re interested in learning more, ping me or Jason. If you want to help support what is an extremely time and resource-intensive (but worthwhile) project, Jason has a Patreon: https://www.patreon.com/pushshift.

 

Digital Methods: when data could be dangerous

For the last week, I’ve been attending the Digital Methods Initiative Summer School at the University of Amsterdam. It’s a conceptually interesting project that attempts to use the digital to study the digital. The theory behind it is that there is a lot of work which uses traditional methods to study the internet or the digital (e.g. interviews, ethnographies, surveys, and so on); likewise, there’s quite a bit that uses the digital to study the “real world” (digital archives, interviews conducted over Skype or email, OCR programmes etc). However, there’s comparatively little academic research that uses the digital to study the digital, and that’s the gap in the market that DMI attempts to fill.

This looks like using digital tools and Snowden’s data leaks to study mass surveillance, using content analysis to study climate change, or studying Wikipedia as a site of cultural heritage. In the case of the group I’ve been working with, we’ve been collaborating with the British Home Office and using various tools to study the Alt-Right. Others in the group have used tools that scrape data from YouTube, Twitter and Facebook and allow them to map the networks that result from this.

As my own PhD primarily uses Reddit as its site of analysis, and I wanted to get some methodological skills out of this summer school, I decided to join the project and help to map the origin, spread and use of the language of the Alt Right across Reddit. To that end, I’ve been using Google’s Big Query API, along with good old Microsoft Excel, to look at the words “cuck” and “kek”, and use that as a window to study the social dynamics of Alt Right communities.

Methodological Reflections

I’ve already written quite a lot on my empirical findings, which I’ll publish in due course, but here I just wanted to take some time to reflect on the things I’ve learned about methodology so far. For my own research, I primarily use qualitative methods: I look at all of the new posts published on the Paleo subreddit, and then I categorise them by topics of conversation and use that categorisation and my own readings of the posts and comments to try to draw inferences about the nature of discourse in that community, focussing on how authority is negotiated and contested. In September, I plan to start seeking interviews with members of that community and the wider paleo and nutrition communities in order to better understand what it is that drives people towards believing and practising one set of nutritional precepts over another. These are all reasonably tried-and-tested qualitative methods that provide rich understandings of interactions between people and the dynamics involved in the communities one studies. 

But there’s always been this nagging sensation that I could be doing something different and better. The people who do quantitative methods produce these firm conclusions that they can make inferences from, right? And they always have these cool, pretty graphics that let them display their research in a way that just invites people to click and read. They get to write things that make headlines. Their research has impacts. They get to make GIFs showing the most used words in /r/The_Donald in every month of its existence:

the_donald.gif

So I wanted to dip my toe in this other academic pool and see if it could resolve some of the anxieties about my research that I’ve been feeling. Perhaps what I’m doing right now is just intrinsically less worthwhile than a different methodology that would allow me to say more things, process more data, be more interesting. 

If you’re reading this in the tone that I’m trying to give off, you’ve likely already concluded that the above is not an accurate reflection of what this research has been like. In reality, just because I’m typing something in a fancy programming interface and using SQL syntax that I’d never heard of before Tuesday, doesn’t mean that what I’ve been doing has produced any more significant results or been any better than any of the traditional methods I’d been using up until last week.

When looking at words, and trying to understand where they’ve come from and how they’ve spread, being able to query a big database and get a picture of when they have and haven’t appeared is useful. But it doesn’t tell you all that much: you don’t know anything about the context in which the word was used. It doesn’t give you the complete picture, in the same way that getting the metadata about someone’s phone calls gives you an incomplete picture of who they are and what they’ve been doing. That picture can be filled in with more knowledge of that person, in the same way that my picture could be filled in with background knowledge about the events that have driven the rise and spread of the Alt Right. But without that knowledge, and without the ability to theorise, it’s pretty useless. Worse than that, it’s dangerous, because it gives you the false impression that you’re finding interesting things when you might not be.

The dangers of data, even with theory

To pin this down with an example: the other day, I was trying to understand who gets labelled a “cuck” and who gets labelled “based” in The_Donald, the biggest Alt-Right community. A cuck is someone who is emasculated and inferior and so on; someone who is “based” has accepted the dogma of the Alt Right and is generally considered a good egg. You can see my findings below:

 

cuck&based

You can see that the words that tend to come up with “cuck” are those you might expect: globalists, Macron (this data was from May 2017, when they were quite upset at France’s failure to elect Le Pen), liberals, “betas”, and so on. Likewise, Sean Hannity and Poland are based, as are patriots.

But then I fell afoul of not having the complete picture, because I started to look at the other associations with “based” and saw words associated with rationalism: logic, science, data and so on. That made me assume they were talking about “based science”, “based logic”, etc, and that they just thought that these things were really cool and were fetishising a certain conception of knowledge.

Now, they do fetishise science and data and a certain conception of logic, but they don’t call those things “based”. They were obviously saying things like “making decisions based on the science”, but because I had no context to how the word was being used, my own understanding of what the word was being used for in this context was blinding me to its much more commonplace usage.

This is the danger of powerful tools: they give you false confidence in your results. I could explain the data in front of me with the theory I had, and I could make up an incredibly convincing argument about it. But because I didn’t have the complete picture, I was actually just making things up without even knowing I was making them up.

We’re often warned about the dangers of data without theory: “p-hacking” and other dodgy statistical methods have become commonplace terms in the scientific and academic communities recently. We rarely talk about the dangers of data with theory when that data is incomplete.

I’ve really enjoyed getting to grips with digital tools. I like the way that they can guide me towards new places to explore, and point me in directions I didn’t previously realise were relevant. They certainly produce some really cool pictures. But when it comes to research significance and impact, I no longer feel inferior because I don’t use numbers and software as much as some people. What I do might not be perfect, but I think I’ll stick with it.