yolo-scraper

TL;DR

I made a python Reddit scraper that crawls r/WallStreetBets posts using praw and the pushshift.io api (via psaw). The Github repo is located here.

r/WallStreetBets is going totally nuts over $GME. Fundamentals have been thrown out the window and the masses are out for Gabriel Plotkin’s blood. The financial media are losing their minds. Not one to want to miss out on the feeding frenzy but also not one to (totally) blindly hop on the gambling bandwagon, I wanted to collect some data about r/WSB first. Enter yolo-scraper.

The main idea is that we totally ignore market fundamentals and acknowledge that r/WSB is a big rocket-shipyard where lots of stocks are waiting dormant to take off to the moon. Not all rockets will launch and even fewer will reach escape velocity. Instead of being powered through some logical means (profitability, growth potential, etc), these stock valuations are based purely on hype. As more r/WSB denizens hype up the stock, the more it is fueled, and the more likely a moonshot will become. $GME is an excellent example of this:

Houston, we have a problem (with efficient-market hypothesis)

If only it was possible to develop some quantitative way to analyze this “hype” fuel and predict whether or not a “stock-rocket” will be a true moonshot. If this was possible, a savvy ~~gambler~~ “investor” could buy early, wait for the rocket to launch, then reap the rewards. This is the though process behind yolo-scraper. The general thought process is:

Scrape r/WSB to capture time series data about number of mentions of a specific stock symbol.
Analyze this data for previous trends of “moonshot” stocks
Attempt to identify “still fueling” stocks before they launch
???
Profit

Scraper Architecture

Turns out Reddit is super easy to scrape and has a really well documented Python API called praw. Unfortunately due to a 2017 change it is really difficult to search posts by date and due to Reddit API constraints, they’ll only return a few (~1000) results to a bulk search. This means that to scrape historical data there is a little bit of legwork since you can’t specify date ranges in the API request.

Luckily a really nice guy put together pushshift.io which is a robust 3rd party database of lots of reddit activity that is sortable by date. Unfortunately for us, they collect posts immediately after submittal so it is difficult to get information on number of upvotes, etc.

In conjunction, we can use the pushshift API to return post ID’s in a specified date range, then make a Reddit API call based on the post ID so we can get relevant statistics on upvotes and similar.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


    from psaw import PushshiftAPI
    import praw
    
    # ...

    # setup API stuff
    client_id, client_secret, user_agent = credentials
    self.r_api = praw.Reddit(client_id=client_id,
                             client_secret=client_secret,
                             user_agent=user_agent)
    self.ps_api = PushshiftAPI(self.r_api)

You’ll notice that we need some credentials. The way that Reddit handles web scraping (and other features such as bots) is that you create an application associated with your user account and then access that app/bot via credentials.

Building the Reddit Bot

This is a very straightforward process that only requires a Reddit account. A lot of these instructions were adapted from/inspired by this post in the footnotes ¹.

Create or login into your Reddit account.
Navigate to reddit.com/prefs/apps
Click “create another app…” and select the “script” radio button

4. Now that we have credentials for our Reddit scraper, we can input them into the config/config.ini file that is read from by yolo-scraper. Instructions on how to do this are in the README of the yolo-scraper repository.

Making an API Call

Now that the credentials are setup, we can access Reddit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


    def API_Call(self):
        # get ids within date range from Pushshift API, we do this first because we can sort by date
        results = list(self.ps_api.search_submissions(before=self.time_curr,
                                                      subreddit='wallstreetbets',
                                                      limit=100))

        # next, we take the ids we parsed and make a reddit api call so we have updated info on scores, etc
        # this is because Pushshift gets posts at time of posting but doesn't track them afterwards (I think?)
        
        self.time_curr = int(results[-1].created_utc)  # update timestamp with oldest call for next time
        result_dict = {
                       'id':[],
                       'created_utc':[],
                       'score':[],
                       'upvoteratio':[],
                       'author':[],
                       'title':[],
                       'selftext':[],
                      }

        # speed this up with batch ID stuff via
        #   https://www.reddit.com/r/redditdev/comments/eisdgs/praw_faster_way_to_fetch_posts_by_id/
        # note there is some FU formatting in the name via
        #   https://www.reddit.com/r/redditdev/comments/gvlg6q/any_way_to_batch_fetch_commentsposts_by_id_in_praw/fspk3at?utm_source=share&utm_medium=web2x&context=3
        ids = [result.id for result in results] 
        creation_utc_list = [result.created_utc for result in results]
        results = [idx if idx.startswith('t3_') else f't3_{idx}' for idx in ids]
        print("calling ", results[:3], "...")
    
        for results_id, submission in enumerate(self.r_api.info(results)):

            # unpack information into result_dict
            print("processing ", submission.id)
            result_dict['id'].append(submission.id)
            result_dict['created_utc'].append(creation_utc_list[results_id])  # pushshift has this info, not praw
            result_dict['score'].append(submission.score)
            result_dict['upvoteratio'].append(submission.upvote_ratio)
            result_dict['author'].append(submission.author)
            result_dict['title'].append(submission.title)
            result_dict['selftext'].append(submission.selftext)
                
        result_df = pd.DataFrame(data=result_dict)
        print("Saving dataframe...\n")
        logging.info("Saving dataframe...\n")
        self.SaveDF(result_df)
        self.CheckDone()

I did this all in an API class that I built which you can checkout in the Github repo if you care. The big picture is that:

An API call is made to the PushShift.io API (via ps_api.search_submissions(...)). This returns a list of post ID’s based on the search parameters.
This list of post ID’s is then sent over in a batch to the Reddit API via praw and we get up to date information about the posts - sorted by date.
Once this current call to Reddit is done, the results are converted into a pandas dataframe and appended to a csv file.

This process is repeated recursively based on the specified time frame found in the config/config.ini file.

Running the Scraper

Because we need the date search functionality we need to use two API’s. This slows the process down and is coupled with the fact that Reddit API requests are limited to sizes of ~1000 and like 60-120 requests per minute (I’ve seen different numbers online; praw handles this internally I think - I haven’t run into any limits).

As a matter of convenience I am running this script on my Nvidia Jetson Nano since it is easy to SSH into, set, and forget. There is a ton of post history (r/WSB has over 2 million members and growing extremely quickly) so it is a pretty slow process and letting it do it via the Nano is very convenient.

Processing the Data

Used in conjunction with ryssdal.jl, it is possible to compare stock price trends with number of mentions on r/WSB; this is how the plot at the top of the page was created.

Medium Post on Reddit Scraping ^[return]