Last week I ran a Twitter survey to see what software my fellow researchers use. It turns out they like R:
As an avid R user myself, this result didn’t surprise me. But it did make me think about my own approach to coding. In this post I’m going to share the tools that I use for my research.
To get a sense for my coding habits, I wrote a script to scan all of my code and break it down by language. Here’s the result:
As you can see, I use R a lot. It’s my go-to language for most analysis. But sprinkled in the mix are C++, Linux shell, and the odd Python script. Here’s how and why I use each language.
R for data analysis and plotting
R is my workhorse language — the one I turn to first when I need to run analysis. R has a special place in my heart because it’s the language that got me excited about coding.
I was first introduced to computer programming way back in 2001 when I was an engineering student at the University of Alberta. I hated the programming course … so much so that after the course finished I vowed never to write code again. (I made good on that promise for more than a decade.)
In hindsight, I understand what went wrong. First, the course was taught using Turbo Pascal, which is (and was at the time) a dying language not worth knowing. Second, I was not prepared for the nightmare that is code debugging. We had to write, debug, and submit code within a one-hour lab. At the end of the hour, I frequently found myself swearing because my code wouldn’t run. Today, I still find debugging frustrating. But with plenty of time to do it, the process is tolerable. Third (and most importantly), I had no goal. I realize now that I’m what you might call a ‘goal-oriented’ programmer. I don’t like coding for its own sake. I like using it to answer questions that puzzle me.
Fast forward to 2015. I was a grad student at York University who was learning (at Jonathan Nitzan’s urging) how to work with the Compustat database. I wanted to know how the employment concentration of the largest US firms had changed with time. (Here’s the answer.)
At the time, I was a devoted spreadsheet user and was trying to analyze the Compustat data using LibreOffice Calc (an open source spreadsheet software). It was a nightmare. Compustat had hundreds of thousands of lines of data, and my spreadsheet kept crashing every time I ran the analysis. Frustrated, I begrudgingly decided to learn a more heavy-duty statistical language. I chose R.
I fully expected a return to the teeth gritting of my Turbo Pascal days. But to my surprise, the frustration never came. I thoroughly enjoyed using R, and was blown away by how much easier it made my analysis. The biggest difference was probably just me: this time around I had a programming goal. That said, there are a host of things that make R more user friendly than the programming languages of old.
Let’s start with the basics. R is what’s called an ‘interpreted’ language. That means each line of code gets interpreted by the computer on a just-in-time basis. The benefit is that this lets you run code line by line. When you’re doing empirical work, that means you can inspect each stage of the analysis — something that’s essential for two reasons. First, you want to get a sense for the data and how it’s being transformed. Second, you’re inevitably going to make errors and want to catch them as you go. Seeing your results line by line helps you catch these mistakes.
At its root, R is a command line program that executes scripts written in plain text. So to write R code, any text editor will do. That said, the most convenient way to write R code is to use RStudio. Think of RStudio as the equivalent of a pilot’s flight console. If R is your airplane, RStudio is the console that tells you about everything that’s going on. It looks like this:
Let’s run through the RStudio layout. In the top left panel is your code — whatever script you’re working on. In the bottom left is the R console, which shows you the commands that R is executing. The top right panel shows you the variables that R has stored in its memory. (You can click on these variables and inspect their contents.) Finally, the bottom right console is where your plots live.
With Rstudio, every step of the analysis — from data to code to results — is visible in one screen. This development format is incredibly useful for doing scientific analysis. Luckily it’s not restricted to R. Most modern statistics software now comes with a reasonably good IDE. (When working with Python, for instance, my go-to environment is Spyder.)
Speaking of statistical software, there are many proprietary alternatives to R — software like SAS, SPSS, Stata, and Statistica. All of these tools will get the job done. That said, if you’re choosing to learn a programming language, I recommend opting for one that is open source. For empirical analysis, that likely means either R or Python.
Let’s run through the advantages of using open source software. First, it’s free. That’s obviously good for you, since you don’t have to buy software. But it’s also good for other scientists who are interested in your work. If you share your code (on Github or on the Open Science Framework), other scientists can run it for free. That’s a big plus.
Another advantage of open-source tools like R is that they have a vibrant online community. That makes it easy to find help. If you have a question about R, chances are that someone else has already asked it (and had it answered) on Stack Overflow. I’m an expert R user, yet have literally never read a manual. I learned everything I know from the online help community. For proprietary software, this community is far less vibrant, so getting help is harder.
Perhaps the biggest reason to use an open-source tool like R is for the ecosystem that surrounds it. The core R language is actually quite minimal. But it is surrounded by a huge ecosystem of packages that can meet your needs. (The same is true of Python.) The R Foundation maintains the CRAN (the Comprehensive R Archive Network), a repository of about 18,000 packages. No matter how niche your analysis, chances are there’s an R package that will make your life easier.
Below are the 4 R packages I use the most.
Let’s walk through these packages. The
data.table package is my workhorse for empirical analysis. The package provides a variable type called a ‘data table’, which is an extension of the ‘data frame’ variable that is part of base R. (A data frame is basically the R equivalent of a spreadsheet — a mixture of text and number data contained in named columns.)
Here’s why I use
data.table so much. First, it comes with an extremely fast function for reading in data (
fread). Second, it has powerful tools for analyzing data in groups. It turns out that much of the analysis I do involves running statistics on groups of data — for instance calculating averages by year. The
data.table package provides a compact syntax for doing this type of analysis. And the code is blazing fast. For that reason, I load
data.table in almost every R script I write.
Next up is the
here package — an unassuming library whose job is to do one thing: tell R the location of your current script. Knowing this location allows you to write code with relative file paths (paths specified relative to the current location). Using relative file paths is good practice because it makes your code portable. You can move the project folder without wrecking the scripts that live inside it. That makes your code easier to share.
Let’s move on to the
magrittr package, another small library that does one thing: it provides a ‘pipe’ function. A ‘pipe’ allows you to dump the input of one function into another function. This is convenient when you are applying a chain of functions.
For instance, to calculate the geometric mean of a set of numbers, you first take the logarithm of each number, then you take the mean of the result, and finally you exponentiate the output. Let
x be my set of numbers. To get the geometric mean using standard R code, you’d write:
geom_mean = exp(mean(log(x)))
You read this function from the inside out using the rules of BEDMAS. That’s fine. But sometimes it’s more convenient to write your code like the English language … from left to write. To do that, you use a pipe function. Here’s the code rewritten using the
magrittr pipe function
geom_mean = x %>% log() %>% mean() %>% exp()
%>% tells R to go from left to right and feed the results of each function into the next one. For certain types of chained analysis, I find this pipe syntax to be useful. Hence I load
magrittr a lot.
Finally, we have
ggplot2, which is my go-to package for plotting data. In my opinion,
ggplot2 is the best plotting software that exists. I’ve written about how I use it here:
Virtually every plot on this blog is made using ggplot. So if you like my charts and want to produce something similar, learn ggplot.
That’s it for R. Let’s move on to C++, my other go-to language.
C++ for speed
After extolling the virtues of a modern, interpreted language like R, it seems crazy that I’d jump to an old compiled language like C++, a language with a reputation for being difficult to use. But hear me out.
I use C++ not for general programming, but mostly as an extension of R. I write specialized functions in C++ and then port them into R where I can use them in my analysis. I still marvel that this is possible. It works via the magic of the Rcpp package, which as the name suggests, is used to port C++ (cpp) code into R. Developed by Dirk Eddelbuettel in the mid 2000s, the package has since become one of the most popular in the R ecosystem.
But why on Earth would you want to bring C++ code into R? In a word, speed. Because it is a compiled language, C++ is far faster than R. When you need to crunch a lot of data quickly, it’s worth the time to write the underlying code in C++. Over the years, I’ve created a toolbox of custom C++ functions that speed up my analysis. (If you’re interested, you can browse my toolbox at GitHub.)
I’m far from an expert C++ coder. But I’m competent enough to write code that solves my particular problems. To do that, I make frequent use of two libraries: Armadillo and Boost.
Armadillo is a scientific computing library that greatly simplifies doing numerical work in C++. It allows you to write C++ code in a syntax similar to MATLAB. If there’s some sort of vector operation you need, chances are that Armadillo has you covered.
Boost is a multi-purpose C++ library that has tools for just about everything. Recently I’ve been doing a lot of text analysis, and found that Boost has essential tools for working with text data. Importantly, you can use both Armadillo and Boost in R, via the RcppArmadillo and BH packages, respectively.
My typical workflow for coding in C++ is to write my functions in the Code::Blocks IDE. Then I port them into R to test them. Once the function is finished, I can either use it in R, or use it in a dedicated C++ program.
Shell scripts for running my code
Let’s move on to shell scripts, which I use to piece together other scripts.
My rule of thumb is to code in small chunks. If a particular script becomes more than a few hundred lines long, I get uneasy. Long scripts confuse me. My solution is to keep scripts short and to the point.
If the analysis is long and complicated, I’ll divide it between sequential scripts that I number in the order that they should be executed. For instance:
1_clean.R 2_analyze.R 3_plot.R
When I’m finished, I typically write a short shell script called
runall.sh that will execute this code in the correct order.
You can write a
runall script in any programming language. But I prefer the (Linux) shell for a few reasons. First, you can easily turn a shell script into an executable file, meaning you can run it by double clicking it. Second, my analysis tends to mix different languages (R, C++, Python). The shell provides a simple way to call these different languages without fuss.
That said, shell scripts are OS specific. So you can’t run my Linux Bash script on your Windows PowerShell. Luckily, Linux is rapidly becoming the back end for almost everything on the internet. As such, Microsoft recently ported the Linux kernel into Windows. That means you can now run Bash scripts almost anywhere.
Python for web scraping
Last up on my coding language list is Python. You probably know that Python is an incredibly popular language. It’s easy to learn, has a vibrant online community, and a huge ecosystem of libraries. So why don’t I use Python more? Mostly because I’m just faster at coding in R. If R can get the job done, I use it.
There is one area, though, where I’ve found Python to be essential: web scraping. The internet is full of interesting information that I want to analyze. But to do that, I first have to scrape it. Python has some of the best tools for doing so.
Let’s start off with how the web works. The web is coded in HTML — a language that tells your browser how to render a page. Since HTML is just plain text, you can use almost anything to scrape it. In R, for instance, there’s a simple function called
readLines that reads the lines of a file. You can use this function to read a file on your computer. But you can also use it to read an HTML page on a distant server. Here’s the code to scrape the Wikipedia entry for ‘shell script’:
shell = readLines("https://en.wikipedia.org/wiki/Shell_script")
When done, the variable
shell will contain all of the HTML code on the Wikipedia page. You can then do with it what you like.
(One of my big projects to date has been to scrape Patreon to understand how income and patronage is distributed on the site. Stay tuned for a post about that.)
Find your coding mix
I never set out to have the coding mix shown in Figure 1. I just started writing code to get the job done, and that’s what came out. If you’re a scientist who’s new to coding, my advice is to learn R or Python (or both).
R has a reputation for being slightly quirky — largely because it’s written by statisticians with the express purpose of analyzing data. For that reason, people with a programming background sometimes find the language a bit odd. But perhaps because my goal was always to analyze data, I never had that experience. I found R intuitive from the start.
Python, in contrast, is a general purpose language. Its syntax is simple. But the caveat is that it wasn’t designed with data analysis in mind. So whereas in R most statistical functions come preloaded, in Python you need to load them. Of course, once you know the library you want (probably NumPy and pandas), that’s not a problem.
Whatever programming language you choose, my advice is to jump in and learn as you go. Sure, you’ll make mistakes. But that’s how you get better. Happy coding!
Support this blog
Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.
Sign up to get email updates from this blog.
[Cover image: Luis Quintero, altered]