I write about my explorations in AI and other quaintitative areas.
For more about me and my other interests, visit playgrd, quaintitative or socials below
There are many ways to access the rich data available in Reddit. You could scrape, or you could use the data that has been kindly made available by Pushshift. The paper here will provide more details on the background of this data.
To get the data from Pushshift, you will first need to install two libraries, praw and pmaw.
pip install praw
pip install pmaw
The steps thereafter are simple, but you need to note that each subreddit in reddit has submissions, as well as the associated comments. You cannot fetch both at the same time, but you can get all the submissions first in a subreddit for a date, and then use the comment ids in the submissions to fetch the comments. So 2 steps. We show the first step here first.
So we first initialize the api object, and then set the dates we want to fetch the submissions for.
api = PushshiftAPI()
subreddit = 'wallstreetbets'
limit = 1000000
comment_threshold = 10
start_date = dt.datetime(2020,1,1) # year, month, day
end_date = dt.datetime(2020,1,5) # year, month, day
# Use this if you want to count from today and get say the last 30 days of submissions
# end_date = dt.datetime.today()
# timespan = dt.timedelta(days=30)
# start_date = end_date - timespan
# print(start_date, '|', end_date)
after = int(start_date.timestamp()) # subs after this date, i.e. start
before = int(end_date.timestamp()) # subs before this date, i.e. end
# print(after, before)
Then we can call the search_submissions
function in the api to get the submissions.
submissions = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after) # get subs
The second set of steps is a bit more tricky. To get the comment ids for a particular submission id, you will need to use the endpoint instead.
api_endpt = f'https://api.pushshift.io/reddit/submission/comment_ids/{submission_id}'
response = requests.get(api_endpt)
comment_ids = list(response.json()['data'])
Then you can use these ids to fetch the comments with search_comments
function in the api.
comments = api.search_comments(ids=comment_ids)
The notebook here has the full code.
And that’s it. Happy data explorations!