Tips for Web Scraping in R and Python

This week I’m sharing some tips for web scraping. The internet is overflowing with information. But it usually takes some work to get the data into a usable form. Here are some tricks for making that happen.

Scraping starts with downloading

Every scrape job starts with a download. When you view a website on your computer, your browser is just displaying data that it downloaded from the website’s server. The first step of scraping is getting this data. Here’s a few ways to do it.

Download source code using your browser

On your browser, you can see a site’s source code by right clicking ‘View Page Source’. To save the code, copy and paste it into a text file. I often start scrape jobs with this manual approach because it is useful for getting a feel for the site’s code.

Download with wget

As the size of your scrape job increases, you’re going to want to automate things. For automated downloading, I use the Linux command-line function wget. True to the Unix philosophy, wget does one thing: download stuff. To download all the data from the site my_url, you’d enter:

wget my_url

As an example, I’m currently writing a post on inflation, and have been frequently downloading price data from the Bureau of Labor Statistics. Their data comes in many forms, but one way to get it is from their plain-text database. Here’s code to download current price data using wget:

wget https://download.bls.gov/pub/time.series/cu/cu.data.0.Current

Download with R

I typically use wget for downloading big databases. For scraping smaller amounts of data, I tend to use R (because that’s where I’ll be working with the data).

In R, the equivalent to wget is the function getURL, which saves web data to a file. I tend not to use this function, however, as I find it less reliable than wget. Instead, when I’m in R, I like to download data directly into memory. You can do that using one of R’s (many) data-reading functions.

The correct tool depends on the format of the data you are downloading. For scraping text data that I’ll need to clean, I use the readLines function. To download data from my_url and save it as the variable page, I’d enter:

page = readLines('my_url')

The nice thing about the readLines function is that it doesn’t care how the data is formatted. It reads every line of a file (or website) into a character vector, which you can than manipulate. That’s really useful for scraping HTML, where you’ll need to clean the data after you download it. But if the data is already in a usable form, there are better approaches.

My favorite tool for reading already-usable data into R is the fread function from the data.table package. Here, for instance, is code that downloads price data from the Bureau of Labor Statistics:

library(data.table)

prices = fread('https://download.bls.gov/pub/time.series/cu/cu.data.0.Current')

Unlike the readLines function, fread will return the data in a usable format (as a data table). What I like about fread is that it is extremely fast, and also smart. It senses how the data is formatted (comma separated, tab separated, etc.) and reads it accordingly. Base R functions like read.csv don’t do that. (And they are also far slower.)

Finding the needle in the haystack

When web data is in a usable format, scraping amounts to little more than downloading. If only web scraping were always this simple.

The problem is that a tiny fraction of the web’s data is designed with analysis in mind. The majority of the web is designed to be rendered in a browser. The result is that the data you want is a needle in a haystack — a tiny bit of text surrounded by a mass of HTML code. The biggest part of scraping is figuring out how to find the needle in the HTML haystack. For that reason, learning to scrape requires understanding the basics of HTML.

HTML is a markup language that tells your browser how to render a website. When you scrape HTML, you want to reverse engineer this markup. Instead of rendering it, you use the markup to find the data you want.

HTML uses tags to format text. If you’re a web designer, you care what these tags do. (Here’s a list of tags and their associated action.) But as a web scraper, what you care about is that these tags are associated with the data you want.

Suppose, for instance, that we want the text from a website’s top-level heading. In HTML, that’s tagged with h1:

<h1>I am a heading</h1>

The HTML tags <h1> … </h1> tell your browser how to render the text inside. As a web scraper, these tags tell us how to find headings. We search for the text <h1> … </h1>.

Inside of the tag angle brackets, < … >, you will often see various formatting instructions:

<h1 formatting_instructions>I am a heading</h1>

Again, these instructions tell your browser how to render the text. As a scraper, these formatting instructions let you hone in on certain types of data.

In HTML, web designers can define a ‘class’, and then tell your browser how to render text that is marked with this class. For instance, I could define a heading as an ‘author’ class:

<h1 class="author">I am a heading</h1>

Classes point to certain types of data — here to ‘authors’. That helps us find the needle in the haystack.

The other HTML element that is helpful for scraping is the ‘id’ attribute, which is typically used for cross-referencing within a document. For instance, the main title on a website might have this id:

<h1 id="main_title">I am a heading</h1>

The key to scraping is figuring out which combination of tags/classes/ids are associated with the data that you want.

A scraping example using R

As an example, let’s use R to scrape data from my post Redistributing Income Through Hierarchy.

Suppose we are interested in the title of the blog post (‘Redistributing Income Through Hierarchy’). How do we get this text?

Step 1: Inspect the code

We start by make friends with the ‘Inspect’ function on our browser. This function tells you the HTML code behind any element on your screen. To access it, right click on the element of interest (here the title of the post) and select ‘Inspect’. In Firefox, the result looks like this:

The ‘Inspect’ function reveals a wealth of information, most of which we don’t care about. We’re interested in the highlighted code shown below:

Our browser tells us that the blog title is nested between the HTML tags <h1 class="entry-title"> and </h1>. With that information in hand, we’re ready to scrape.

Step 2: Download the page’s source code

Next, we read the page’s source code into R. It doesn’t matter how you do this, but I tend to use the readLines function. We’ll download the code and dump it into the variable page:

page = readLines("https://economicsfromthetopdown.com/2021/10/24/redistributing-income-through-hierarchy/")

The variable page will be a character vector. Each element contains a line of website code. In this case, we don’t care about line breaks, so we’ll collapse the vector into a single string:

page = paste(page, collapse = "")

Now the variable page is one long string of text. It is a giant haystack with a needle — the title of the blog post — hidden somewhere inside. How do we find this needle?

When it comes to web scraping, string search functions are your friend. We know that the blog title is surrounded by the text <h1 class="entry-title"> and </h1>. So we just need to find this text, and pick out what lies inside.

R has many tools for working with strings. The most accessible is probably the stringr package. It contains a function called str_match that matches strings … just what we need.

We’re going to tell str_match to find the text between our two HTML tags. Here’s the code:

library(stringr)
title =  str_match(page, '<h1 class="entry-title">\\s*(.*?)\\s*</h1>')

The code \\s*(.*?)\\s* is a ‘regular expression’ that tells the str_match function to extract and trim the text between the string <h1 class="entry-title"> and the string </h1>.

To be honest, the syntax for regular expressions still baffles me. Fortunately, the internet can answer most questions. (This particular solution comes from Stack Overflow.)

We now have a variable called title that should contain the title of the post. Let’s see what it holds. In this case, the cleaned text is contained in the second element, title[2]. Let’s see what lies within:

> title[2]
[1] "Redistributing Income Through Hierarchy"

It worked! We have successfully scraped the title of a blog post!

Admittedly, we could have done the same job (more quickly) by copying the title from our web browser. But the beauty of doing the job with code is that we can apply it repeatedly in a fraction of the time it would take to do by hand.

For reference, here’s the working code:

library(stringr)

page = readLines("https://economicsfromthetopdown.com/2021/10/24/redistributing-income-through-hierarchy/")

page = paste(page, collapse = "")

title =  str_match(page, '<h1 class="entry-title">\\s*(.*?)\\s*</h1>')

Use sitemaps

In the toy example above, we scraped data from a single page. Usually, though, we want data from multiple pages on a site. To do this more complex job, sitemaps are your friend.

Sitemaps are designed mostly for search engines, but they’re a great resource for finding every page on a site. The sitemap typically lives in a file called sitemap.xml. For my blog, the sitemap lives here:

https://economicsfromthetopdown.com/sitemap.xml

When we visit this page, it tells us that there are actually two sitemaps — one for pages and one for images:

Let’s go to the sitemap for pages, located at https://economicsfromthetopdown.com/sitemap-1.xml. It looks like this:

Here we have a list of URLs for every page on Economics from the Top down. No human would want to browse this page. But it is a goldmine for text scraping.

Suppose we wanted to scrape data from every page on this blog. We’d start by scraping the URLs from the sitemap. We can do that using the steps from our previous example.

First, we inspect the URLs to see what tags identify them. I find that each URL is surrounded by the <loc> </loc> tags. With that information, we can reuse our code to scrape the URLs from the sitemap.

library(data.table)
library(stringr)

page = readLines("https://economicsfromthetopdown.com/sitemap-1.xml")

page = paste(page, collapse = "")

links =  str_match_all(page, '<loc>s*(.*?)\\s*</loc>')

links = as.data.table(links)$V2

The only difference here (from the previous example) is that I use str_match_all (rather than str_match) because I expect to find multiple matches. And I extract the final data in a slightly different way by converting to a data.table.

The result is a list of URLS for every page on this blog. If you want to scrape this site, that’s exactly what you need.

Scraping JavaScript sites

So far, we’ve dealt with sites that serve HTML. In this scenario, scraping amounts to downloading the source code and then figuring out how to find the data you want.

Increasingly, however, web scraping requires an extra step. That’s because many sites don’t serve HTML … they send your computer JavaScript, which your browser then renders into HTML. If you try to scrape this type of site with the tools above, you’ll end up with gibberish. (Fortunately, there’s an easy workaround.)

How do you know if a site is serving JavaScript? Click ‘View Page Source’ and look for the <script> tag. If the site serves JavaScript, you’ll find something like this:

<script type="text/javascript">

The other clue that a site serves JavaScript is that none of the site’s content (that you see in your browser) shows up in the source code. That’s because the HTML content is generated by the JavaScript engine in your browser.

The problem with JavaScript sites is that the scraping tools described above do not work. They’ll scrape the JavaScript itself, not the site content. Fortunately, there’s an easy solution: get your browser to do the work. The way to scrape JavaScript sites is to automate your browser. You first get your browser to render the JavaScript into HTML. Then you scrape the HTML using the tools described above.

Automate your browser with Selenium

The go-to method for automating your browser is software called Selenium. You can call Selenium using a variety of different programming languages, but I prefer to use Python.

First, the prerequisites. Selenium is going to automate your browser, so you need to have a browser installed. I prefer to use Firefox (see reasons below), but you can also use Chrome.

Second, you need to install a ‘driver’ for your browser. The Firefox driver is called geckodriver. The Chrome driver is called chromedriver. Both consist of a single executable file that you ‘install’ by putting in your PATH directory. On Linux/Unix, that’s the directory /usr/bin. (Here are Windows instructions.)

Once the appropriate driver is installed, you can call Selenium from Python. For example, the Python code below uses Firefox to load the New York Times homepage:

# Python code 

from selenium import webdriver

driver = webdriver.Firefox()

driver.get("https://www.nytimes.com/")

When you run this code, Firefox will open and load the New York Times homepage. That’s cool … but you could easily do the same thing without code. However, when you automate your browser, you can tell it to send the HTML it rendered (from JavaScript) back to you. In Python, you do that with the following code:

page = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

The page variable now contains the HTML content for the New York Times homepage. It’s just a string of text that you can scrape using the techniques discussed above. (Note that even on JavaScript sites, the browser ‘Inspect’ function still works, so you can easily find the HTML tags that locate your data.)

When you’re done, close your browser with this command:

driver.quit()

If your scrape job is fairly data intensive, consider running your browser in headless mode (without a user interface). That will make the job faster and less resource intensive.

Which browser should I automate?

I’ve run Selenium with both Chrome and Firefox. For small jobs, both work fine. However, I’ve recently encountered a scenario where Firefox is far superior: when you’re scraping a site that is ‘beshitted’ with adware.

My colleague DT Cochrane and I have recently started scraping a site that uses (admittedly important) financial data as an excuse to bombard the reader with ads. To scrape the site, you need to waste a huge amount of resources just rendering the ads, after which you can scrape a few kilobytes of data.

For ‘beshitted’ sites like this, I think Firefox is clearly superior to Chrome. Although I’m no browser expert, I suspect the reason is that Chrome is made by an ad-tech giant (Google). Chrome basically exists for Google to suck your data, and for you to see ads that Google serves. Possibly for this reason, I find that Chrome crashes when scraping ad-tech beshitted sites. Firefox, on the other hand, handles the ad-tech with ease.

Start easy

If you’re new to web scraping, I recommend starting on easy sites — sites like Wikipedia that exist purely to convey information to users. Their code is usually wonderfully simple. As a rule, the more commercial the site, the more treacherous the code.

Also, remember that scraping puts a burden on a website’s server. Yes, you want their data. But you don’t want to crash their server. (Here’s the arXiv’s humorous warning to robots.) Also remember that if you scrape a site too heavily, its maintainers may put up a CAPTCHA to thwart your efforts. (I’ve had that happen to me.)

Happy scraping!


Support this blog

Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.

patron_button


Stay updated

Sign up to get email updates from this blog.



This work is licensed under a Creative Commons Attribution 4.0 License. You can use/share it anyway you want, provided you attribute it to me (Blair Fix) and link to Economics from the Top Down.


[Cover image: Markus Spiske]

One comment

  1. Hi Blair, I will do a little ‘muckraking’ myself and object to you using the phrase “HTML code” when you should say “HTML markup” or “HTML document”! Why am I a pendant on this subject? Well because I was once a great fan of ‘XML Technologies’ and the work of the World Wide Web Consortium ‘W3C’ to develop them. I tried hard to build a career around them in fact.

    Why do I care about this, well the vision of the W3C was to make the web a giant database, and that vision of a ‘Semantic Web’ (of data and metadata) was never achieved. Instead, various commercial interests took control and real ‘code’, in the form of JavaScript, and more particularly data in the format of JavaScript Object Notation (JSON) came to dominate over XML for data exchange. Google made the web more searchable and became the gatekeeper, where the vision of W3C was much more democratic and free-wheeling (like the ‘good old days’).

    If anything HTML is data and there are document databases that use the W3C XML Technologies as their basis (MarkLogic, eXistDB, BaseX), particularly XQuery, whose development was an enormous project involving the leading database vendors, and which was considered a potential replacement for SQL (have a look at the XQuery FLOWR statement). Though databases nowadays these go under the strangely named category of ‘NoSQL’ (Not only SQL) and their significance is lost again. As a language for manipulation of data in any format XQuery is hard to beat. There is a JSON flavour of XQuery too called JSONiq (https://www.jsoniq.org/) and I recently learned of a big-data project using it called RumbleDB (https://www.rumbledb.org/).

    As a document format/grammar XML is actually a dumbed down version of SGML, which is worth knowing about even if just for historical reasons (invented by a lawyer at IBM) and still used.

    But R is good too, and I not expecting you to switch to XQuery, but thinking of HTML as data would be good. There is a thing called data-driven programming, an W3C standard called XForms was essentially that, declarative logic in web forms. JavaScript and Google pushed it aside.

Leave a Reply