This week I’m sharing some tips for web scraping. The internet is overflowing with information. But it usually takes some work to get the data into a usable form. Here are some tricks for making that happen.
Scraping starts with downloading
Every scrape job starts with a download. When you view a website on your computer, your browser is just displaying data that it downloaded from the website’s server. The first step of scraping is getting this data. Here’s a few ways to do it.
Download source code using your browser
On your browser, you can see a site’s source code by right clicking ‘View Page Source’. To save the code, copy and paste it into a text file. I often start scrape jobs with this manual approach because it is useful for getting a feel for the site’s code.
Download with wget
As the size of your scrape job increases, you’re going to want to automate things. For automated downloading, I use the Linux command-line function
wget. True to the Unix philosophy,
wget does one thing: download stuff. To download all the data from the site
my_url, you’d enter:
As an example, I’m currently writing a post on inflation, and have been frequently downloading price data from the Bureau of Labor Statistics. Their data comes in many forms, but one way to get it is from their plain-text database. Here’s code to download current price data using
Download with R
I typically use
wget for downloading big databases. For scraping smaller amounts of data, I tend to use R (because that’s where I’ll be working with the data).
In R, the equivalent to
wget is the function
getURL, which saves web data to a file. I tend not to use this function, however, as I find it less reliable than
wget. Instead, when I’m in R, I like to download data directly into memory. You can do that using one of R’s (many) data-reading functions.
The correct tool depends on the format of the data you are downloading. For scraping text data that I’ll need to clean, I use the
readLines function. To download data from
my_url and save it as the variable
page, I’d enter:
page = readLines('my_url')
The nice thing about the
readLines function is that it doesn’t care how the data is formatted. It reads every line of a file (or website) into a character vector, which you can than manipulate. That’s really useful for scraping HTML, where you’ll need to clean the data after you download it. But if the data is already in a usable form, there are better approaches.
My favorite tool for reading already-usable data into R is the
fread function from the
data.table package. Here, for instance, is code that downloads price data from the Bureau of Labor Statistics:
library(data.table) prices = fread('https://download.bls.gov/pub/time.series/cu/cu.data.0.Current')
fread will return the data in a usable format (as a data table). What I like about
fread is that it is extremely fast, and also smart. It senses how the data is formatted (comma separated, tab separated, etc.) and reads it accordingly. Base R functions like
read.csv don’t do that. (And they are also far slower.)
Finding the needle in the haystack
When web data is in a usable format, scraping amounts to little more than downloading. If only web scraping were always this simple.
The problem is that a tiny fraction of the web’s data is designed with analysis in mind. The majority of the web is designed to be rendered in a browser. The result is that the data you want is a needle in a haystack — a tiny bit of text surrounded by a mass of HTML code. The biggest part of scraping is figuring out how to find the needle in the HTML haystack. For that reason, learning to scrape requires understanding the basics of HTML.
HTML is a markup language that tells your browser how to render a website. When you scrape HTML, you want to reverse engineer this markup. Instead of rendering it, you use the markup to find the data you want.
HTML uses tags to format text. If you’re a web designer, you care what these tags do. (Here’s a list of tags and their associated action.) But as a web scraper, what you care about is that these tags are associated with the data you want.
Suppose, for instance, that we want the text from a website’s top-level heading. In HTML, that’s tagged with
<h1>I am a heading</h1>
The HTML tags
<h1> … </h1> tell your browser how to render the text inside. As a web scraper, these tags tell us how to find headings. We search for the text
<h1> … </h1>.
Inside of the tag angle brackets,
< … >, you will often see various formatting instructions:
<h1 formatting_instructions>I am a heading</h1>
Again, these instructions tell your browser how to render the text. As a scraper, these formatting instructions let you hone in on certain types of data.
In HTML, web designers can define a ‘class’, and then tell your browser how to render text that is marked with this class. For instance, I could define a heading as an ‘author’ class:
<h1 class="author">I am a heading</h1>
Classes point to certain types of data — here to ‘authors’. That helps us find the needle in the haystack.
The other HTML element that is helpful for scraping is the ‘id’ attribute, which is typically used for cross-referencing within a document. For instance, the main title on a website might have this id:
<h1 id="main_title">I am a heading</h1>
The key to scraping is figuring out which combination of tags/classes/ids are associated with the data that you want.
A scraping example using R
As an example, let’s use R to scrape data from my post Redistributing Income Through Hierarchy.
Suppose we are interested in the title of the blog post (‘Redistributing Income Through Hierarchy’). How do we get this text?
Step 1: Inspect the code
We start by make friends with the ‘Inspect’ function on our browser. This function tells you the HTML code behind any element on your screen. To access it, right click on the element of interest (here the title of the post) and select ‘Inspect’. In Firefox, the result looks like this:
The ‘Inspect’ function reveals a wealth of information, most of which we don’t care about. We’re interested in the highlighted code shown below:
Our browser tells us that the blog title is nested between the HTML tags
<h1 class="entry-title"> and
</h1>. With that information in hand, we’re ready to scrape.
Step 2: Download the page’s source code
Next, we read the page’s source code into R. It doesn’t matter how you do this, but I tend to use the
readLines function. We’ll download the code and dump it into the variable
page = readLines("https://economicsfromthetopdown.com/2021/10/24/redistributing-income-through-hierarchy/")
page will be a character vector. Each element contains a line of website code. In this case, we don’t care about line breaks, so we’ll collapse the vector into a single string:
page = paste(page, collapse = "")
Now the variable
page is one long string of text. It is a giant haystack with a needle — the title of the blog post — hidden somewhere inside. How do we find this needle?
Step 3: Find your data with a string search
When it comes to web scraping, string search functions are your friend. We know that the blog title is surrounded by the text
<h1 class="entry-title"> and
</h1>. So we just need to find this text, and pick out what lies inside.
R has many tools for working with strings. The most accessible is probably the
stringr package. It contains a function called
str_match that matches strings … just what we need.
We’re going to tell
str_match to find the text between our two HTML tags. Here’s the code:
library(stringr) title = str_match(page, '<h1 class="entry-title">\\s*(.*?)\\s*</h1>')
\\s*(.*?)\\s* is a ‘regular expression’ that tells the
str_match function to extract and trim the text between the string
<h1 class="entry-title"> and the string
To be honest, the syntax for regular expressions still baffles me. Fortunately, the internet can answer most questions. (This particular solution comes from Stack Overflow.)
We now have a variable called
title that should contain the title of the post. Let’s see what it holds. In this case, the cleaned text is contained in the second element,
title. Let’s see what lies within:
> title  "Redistributing Income Through Hierarchy"
It worked! We have successfully scraped the title of a blog post!
Admittedly, we could have done the same job (more quickly) by copying the title from our web browser. But the beauty of doing the job with code is that we can apply it repeatedly in a fraction of the time it would take to do by hand.
For reference, here’s the working code:
library(stringr) page = readLines("https://economicsfromthetopdown.com/2021/10/24/redistributing-income-through-hierarchy/") page = paste(page, collapse = "") title = str_match(page, '<h1 class="entry-title">\\s*(.*?)\\s*</h1>')
In the toy example above, we scraped data from a single page. Usually, though, we want data from multiple pages on a site. To do this more complex job, sitemaps are your friend.
Sitemaps are designed mostly for search engines, but they’re a great resource for finding every page on a site. The sitemap typically lives in a file called
sitemap.xml. For my blog, the sitemap lives here:
When we visit this page, it tells us that there are actually two sitemaps — one for pages and one for images:
Let’s go to the sitemap for pages, located at
https://economicsfromthetopdown.com/sitemap-1.xml. It looks like this:
Here we have a list of URLs for every page on Economics from the Top down. No human would want to browse this page. But it is a goldmine for text scraping.
Suppose we wanted to scrape data from every page on this blog. We’d start by scraping the URLs from the sitemap. We can do that using the steps from our previous example.
First, we inspect the URLs to see what tags identify them. I find that each URL is surrounded by the
<loc> </loc> tags. With that information, we can reuse our code to scrape the URLs from the sitemap.
library(data.table) library(stringr) page = readLines("https://economicsfromthetopdown.com/sitemap-1.xml") page = paste(page, collapse = "") links = str_match_all(page, '<loc>s*(.*?)\\s*</loc>') links = as.data.table(links)$V2
The only difference here (from the previous example) is that I use
str_match_all (rather than
str_match) because I expect to find multiple matches. And I extract the final data in a slightly different way by converting to a
The result is a list of URLS for every page on this blog. If you want to scrape this site, that’s exactly what you need.
So far, we’ve dealt with sites that serve HTML. In this scenario, scraping amounts to downloading the source code and then figuring out how to find the data you want.
Automate your browser with Selenium
The go-to method for automating your browser is software called Selenium. You can call Selenium using a variety of different programming languages, but I prefer to use Python.
First, the prerequisites. Selenium is going to automate your browser, so you need to have a browser installed. I prefer to use Firefox (see reasons below), but you can also use Chrome.
Second, you need to install a ‘driver’ for your browser. The Firefox driver is called geckodriver. The Chrome driver is called chromedriver. Both consist of a single executable file that you ‘install’ by putting in your PATH directory. On Linux/Unix, that’s the directory
/usr/bin. (Here are Windows instructions.)
Once the appropriate driver is installed, you can call Selenium from Python. For example, the Python code below uses Firefox to load the New York Times homepage:
# Python code from selenium import webdriver driver = webdriver.Firefox() driver.get("https://www.nytimes.com/")
page = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
When you’re done, close your browser with this command:
If your scrape job is fairly data intensive, consider running your browser in headless mode (without a user interface). That will make the job faster and less resource intensive.
Which browser should I automate?
I’ve run Selenium with both Chrome and Firefox. For small jobs, both work fine. However, I’ve recently encountered a scenario where Firefox is far superior: when you’re scraping a site that is ‘beshitted’ with adware.
My colleague DT Cochrane and I have recently started scraping a site that uses (admittedly important) financial data as an excuse to bombard the reader with ads. To scrape the site, you need to waste a huge amount of resources just rendering the ads, after which you can scrape a few kilobytes of data.
For ‘beshitted’ sites like this, I think Firefox is clearly superior to Chrome. Although I’m no browser expert, I suspect the reason is that Chrome is made by an ad-tech giant (Google). Chrome basically exists for Google to suck your data, and for you to see ads that Google serves. Possibly for this reason, I find that Chrome crashes when scraping ad-tech beshitted sites. Firefox, on the other hand, handles the ad-tech with ease.
If you’re new to web scraping, I recommend starting on easy sites — sites like Wikipedia that exist purely to convey information to users. Their code is usually wonderfully simple. As a rule, the more commercial the site, the more treacherous the code.
Also, remember that scraping puts a burden on a website’s server. Yes, you want their data. But you don’t want to crash their server. (Here’s the arXiv’s humorous warning to robots.) Also remember that if you scrape a site too heavily, its maintainers may put up a CAPTCHA to thwart your efforts. (I’ve had that happen to me.)
Support this blog
Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.
Sign up to get email updates from this blog.
This work is licensed under a Creative Commons Attribution 4.0 License. You can use/share it anyway you want, provided you attribute it to me (Blair Fix) and link to Economics from the Top Down.
Hi Blair, I will do a little ‘muckraking’ myself and object to you using the phrase “HTML code” when you should say “HTML markup” or “HTML document”! Why am I a pendant on this subject? Well because I was once a great fan of ‘XML Technologies’ and the work of the World Wide Web Consortium ‘W3C’ to develop them. I tried hard to build a career around them in fact.
If anything HTML is data and there are document databases that use the W3C XML Technologies as their basis (MarkLogic, eXistDB, BaseX), particularly XQuery, whose development was an enormous project involving the leading database vendors, and which was considered a potential replacement for SQL (have a look at the XQuery FLOWR statement). Though databases nowadays these go under the strangely named category of ‘NoSQL’ (Not only SQL) and their significance is lost again. As a language for manipulation of data in any format XQuery is hard to beat. There is a JSON flavour of XQuery too called JSONiq (https://www.jsoniq.org/) and I recently learned of a big-data project using it called RumbleDB (https://www.rumbledb.org/).
As a document format/grammar XML is actually a dumbed down version of SGML, which is worth knowing about even if just for historical reasons (invented by a lawyer at IBM) and still used.