Making Beautiful Charts Using R ggplot

Over the past few years, I’ve spent a huge chunk of time learning how to make beautiful graphics using R ggplot. In this post, I’m going to share what I’ve learned. I’m going to show you how to produce the following chart:

fig_14_final

This chart is from The Growth of Hierarchy and the Death of the Free Market. It shows how the employment of managers grows with energy use. The colored points show the results of a simulation in which hierarchy grows with energy. The ‘span of control’ is the number of subordinates controled by each superior in simulated hierarchies. The simulation suggests that the growth of managers is caused by the growth of hierarchy. To learn more about the hierarchy model, try this app.

Try the code yourself

The data and code used in this post are available here. Download the code and mess with it. That’s the best way to learn.

The Data

To create the chart above, we’ll work with two sets of data, each contained in a csv file. The file managers_energy_data.csv contains empirical data for the management share of employment (expressed as a percentage) and energy use per capita (in GJ) for a variety of countries in the world. The file structure looks like this:

country_code year country managers_employment_share energy_pc
AGO 1991 Angola 8.75 23.15
ARE 1992 United Arab Emirates 9.01 451.64
EST 1996 Estonia 8.73 172.34
TJK 2002 Tajikistan 1.94 16.87

The file simulation.csv contains results for a simulation in which hierarchy grows with energy use. The simulation returns data for the employment share of managers (as a percentage), energy use per capita (in GJ), and the span of control within modeled hierarchies. The file structure looks like this:

managers_employment_share energy_pc span_of_control
5.24 446.46 3.84
3.00 481.16 5.02
1.97 377.63 5.96
0.23 12.50 5.09

We’ll use this data to show how the growth of hierarchy can explain the growth of management.

R

R is an open source language that specializes in statistical analysis. In this post, I assume that you have R running on your computer using Rstudio. You can download R here and Rstudio here.

For an introduction to R, check out r-tutor.com and the official introduction.

ggplot2

I do all of my plotting using ggplot2, an R package written by Hadley Wickham. The ‘gg’ refers to ‘grammar of graphics’ — a philosophy in which graphics are built up one layer at a time. To learn ggplot, I suggest you read Wickham’s book and then use this reference page to answer your questions.

To load the ggplot library, we run:


# load ggplot
library(ggplot2)

A basic plot

To make a basic plot, the first thing we're going to do is load our two data sets:


# load data
managers_energy = read.csv("managers_energy_data.csv")
simulation = read.csv("simulation.csv")

Next, we'll make a basic ggplot. Compared to other plotting languages, ggplot syntax might seem weird at first. In ggplot, we build the plot one layer at a time. The first thing we do is create a blank canvas by calling the ggplot() command:


# blank ggplot
manager_plot =  ggplot() 

This creates a blank ggplot called manager_plot. To this canvas, we'll add different 'geometric objects'. In ggplot notation, these geometric objects are called a geom. The geom tells ggplot how we want the data represented. To represent the data using points, we use geom_point. To represent the data using lines, we use geom_line, and so on. Here we'll use points:


# basic ggplot syntax
manager_plot =  ggplot() + geom_point()

This is the basic syntax of a ggplot chart. We first evoke ggplot, and then add features to the plot using the + sign.

Next we need to add data. Inside geom_point, we tell ggplot to use managers_energy as the source data:


# add data
manager_plot =  ggplot() +
                geom_point(data = managers_energy)

We can also put the command data = managers_energy inside the ggplot() command, as in ggplot(data = managers_energy). Personally, I don't like to do this because my plots usually combine different datasets. Putting the data inside the ggplot() command locks the whole chart into using only that data.

Next, we tell ggplot about the 'aesthetics' we want, using the aes() command. We tell ggplot that the x-axis should plot energy_pc and the y-axis should plot managers_employment_share. This gives us the syntax for a basic ggplot:


# basic plot of managers vs. energy use
manager_plot =  ggplot() +
                geom_point(data = managers_energy,
                aes(x = energy_pc, y = managers_employment_share))

The plot looks like this:

fig_01_basic
Basic plot of managers vs. energy use

Refining the chart

The secret to good data visualization, I've found, is the refinements that come after you've created a basic chart. These refinements highlight the aspects of the data that you want to showcase.

First, let's refine the size of our data points. My philosophy is that the point size of scatter plots should vary inversely with the number of points. If you have only a few data observations, you want large points so you can see the data. But if you have many data observations (thousands or millions), you want to shrink the point size so that you can actually see all the data.

In our managers plot, we've go quite a few data observations. So let's shrink the point size from the ggplot default. To do this, we'll put size = 0.8 inside geom_point. For reasons that I'll discuss later, this size command doesn't go inside the aesthetic command aes().


# smaller point size
manager_plot = ggplot() +
                geom_point( data = managers_energy,
                size = 0.8,
                aes(x = energy_pc, y = managers_employment_share))

Reducing the point size in our scatter plot gives us:

fig_02_size
Smaller point size

The next thing I notice about the plot is that the data is crushed against the origin. When you see this happen, it's a good sign that you need to use logarithmic scales. Log scales spread the data out so that we can see variation in all the observations, not just the largest ones.

Let's tell ggplot to use logarithmic scales instead of linear scales:


# add log scales
manager_plot =  manager_plot +
                scale_x_log10() +
                scale_y_log10()

Here I'm using an interesting feature of ggplot --- it let's you recursively add layers to your plot. Having defined manager_plot, we tell ggplot to change the axes by adding commands to the original plot. To be honest, I don't use this recursive feature very often. But it's useful here because I can highlight the new code that I've adding with each refinement to the chart. Changing to log scales gives us:

fig_03_log_scale
Add log scales

Now the scatter plot looks much better. We can actually see the trend across countries.

Next, let's tweek the values on the axes. When log scales span only a few orders of magnitude, I like to add numbers in between the factors of ten. To change the axis numbers, we use the breaks command. To make custom breaks, we use the concatenate command c(). If I wanted axis labels of 1, 5, and 10, I'd write breaks = c(1, 5, 10). Here's the custom breaks that I'll use:


# better axis breaks
manager_plot = manager_plot +
  scale_x_log10(breaks = c(5,10,20,50,100,200,500,1000)) +
  scale_y_log10(breaks = c(0.1,0.2,0.5,1,2,5,10,20))

This gives a plot with better axis numbers:

fig_04_breaks
Better axis breaks

Next, let's fix our axis labels. By default, ggplot will use your variable names as the axis labels. This is rarely what you want in your final plot. To change the axis labels we use the command labs(). While we're at it, we'll add a title to the chart using ggtitle():


# descriptive labels and title
manager_plot = manager_plot +
  labs(x = "Energy use per capita (GJ)",
       y = "Managers (% of Total Employment)" ) +
  ggtitle("Managers Employment vs. Energy Use")

Now our plot has better labels:

fig_05_labels
Descriptive labels and title

Adding simulation data

To our empirical data, we'll now add the simulation data. We're going to use one of the nicest features of ggplot: the ability to use color to represent changes in a variable. To do this, we put the color command inside the aesthetics, aes().

In our simulation, we want energy_pc on the x-axis, managers_employment_share on the y-axis, and span_of_control in color. To plot this using points, we write:


#plot simulation data with span of control indicated by color
  geom_point(data = simulation,
             aes(x = energy_pc,
                 y = managers_employment_share,
                 color = span_of_control)
             ) 
 

The logic here is that any aesthetic getting mapped onto variables goes inside the aes() command. If I wanted point size to be a function of the span_of_control, I would write:


# point size as function of span of control
  geom_point(data = simulation,
             aes(x = energy_pc,
                 y = managers_employment_share,
                 size = span_of_control)
             ) 

But if I want to set the size of points to a single value, this goes outside the aes() command.


# point size has a single value
  geom_point(data = simulation,
             size = 0.1,
             aes(x = energy_pc,
                 y = managers_employment_share,
                 color = span_of_control)
             ) 

Let's add the simulation data to our management plot. We want the simulation data to appear under the empirical data, so we have to add it to the ggplot before adding the empirical data.

Because we don't want the simulation data to overwhelm the empirical data, we're going to make the simulation data partially transparent. This makes it feel like it's in the background.

In ggplot, we set the transparency of our points using the alpha command. alpha = 0 is completely transparent. alpha = 1 is completely opaque. We'll add alpha = 0.3 inside our geom. Heres the code with the simulation data added to the empirical data, along with all the refinements so far:


# add simulation data
manager_plot = ggplot() +
  geom_point(data = simulation,
             size = 0.1,
             alpha = 0.3,
             aes(x = energy_pc,
                 y = managers_employment_share,
                 color = span_of_control)
             ) +
  geom_point(data = managers_energy,
             size = 0.8,
             aes(x = energy_pc,
                 y = managers_employment_share)
             ) +
  scale_x_log10(breaks = c(5,10,20,50,100,200,500,1000)) +
  scale_y_log10(breaks = c(0.1,0.2,0.5,1,2,5,10,20)) +
  labs(x = "Energy use per capita (GJ)",
       y =  "Managers (% of Total Employment)") +
  ggtitle("Managers Employment vs. Energy Use") 

This code gives us:

fig_06_simulation
Add simulation data

More refinements

After adding the simulation data, we need to do more plot refining. First, the simulation data spans a far greater range than the empirical data. So now our empirical data is compressed into the corner of the chart. We don't want that.

We'll fix this by limiting the x-y range of the chart using the command coord_cartesian(). Inside the command we put the x and y range that we want. I'll restrict x to range from 5 to 1000 and y from 0.1 to 30. We use the concatenate function c() to denote these limits:


# limit plot range
manager_plot = manager_plot +
  coord_cartesian(xlim = c(5,1000), ylim = c(0.1,30)) 

Our plot now looks like this:

fig_07_cartesion
Limit plot range

Notice that ggplot has again used variable names to label the plot, this time for the color legend. We fix this using the labs() command. We want to label the color scale "Span of Control", so we write:


# descriptive label for color legend  
manager_plot = manager_plot +
  labs(color = "Span of Control")

We get:

fig_08_span_label
Descriptive label for color legend

Adding the label creates a new problem. The label is too long and compresses the graph. To fix this, we add a line split to the label using \n:


# line break in legend label
manager_plot = manager_plot +
  labs(color = "Span of\nControl")

We now get:

fig_09_span_label_line
Line break in legend label

Now let's refine the colors used by ggplot to represent the span of control. By default, ggplot uses shades of blue. I prefer to use the whole color spectrum. To represent the span of control using a rainbow with 8 colors, we write:


# rainbow colors for span of control
manager_plot = manager_plot +
  scale_color_gradientn( colours = rainbow(8) )

Now the chart is starting to pop!

fig_10_rainbow
Rainbow colors for span of control

But if we're picky (and we should be), we see that the rainbow on the color legend is upside down compared to the rainbow in the chart. Let's fix that by reversing the direction of the legend:


# reverse color legend
manager_plot = manager_plot +
  scale_color_gradientn(colours = rainbow(8),
                        guide = guide_colourbar(reverse = T))

Now the legend and the chart have matching rainbows:

fig_11_rainbow_reverse
Reverse direction of color legend

The plot theme

The default ggplot theme uses a grey background. We can change the theme using the theme command. I prefer the black and white theme, theme_bw():


# black and white theme
manager_plot = manager_plot + theme_bw()

Our plot now looks like this:

fig_12_black_white
Black and white theme

I also prefer serif fonts over sans-serif. Let's change the font to Times:


# change font to Times
manager_plot = manager_plot +
  theme(text=element_text(size = 10, family="Times"))

Our chart is looking close to the final version:

fig_13_times
Change font to Times

The last thing we'll do is add my personal theme that I use for all my plots. This theme removes the grid lines and flips the tick marks to the inside of the plot box. It also centeres the plot title and makes it bold. Here's the code:


theme(panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.title = element_text(face="bold", size = rel(1), hjust = 0.5),
      axis.line = element_line(color = "black"),
      axis.title.x = element_text(vjust= 0, size=rel(0.9)),
      axis.title.y = element_text(vjust= 1.1, size=rel(0.9)),
      axis.text.x = element_text(margin=margin(5,5,0,0,"pt")),
      axis.text.y = element_text(margin=margin(3,5,0,3,"pt")),
      axis.ticks.length = unit(-0.7, "mm"),
      text=element_text(size = 10, family="Times"))

Putting all the steps together, here's the finished code for the graphic:


# all code with custom theme
manager_plot = ggplot() +
  geom_point(data = simulation,
             size = 0.1,
             alpha = 0.3,
             aes(x = energy_pc,
                 y = managers_employment_share,
                 color = span_of_control)
  ) +
  geom_point(data = managers_energy,
             size = 0.8,
             aes(x = energy_pc,
                 y = managers_employment_share)
  ) +
  scale_x_log10(breaks = c(5,10,20,50,100,200,500,1000)) +
  scale_y_log10(breaks = c(0.1,0.2,0.5,1,2,5,10,20)) +
  labs(x = "Energy use per capita (GJ)",
       y =  "Managers (% of Total Employment)",
       color = "Span of \nControl") +
  ggtitle("Managers Employment vs. Energy Use") +
  coord_cartesian(xlim = c(5,1000), ylim = c(0.1,30)) +
  scale_color_gradientn(colours = rainbow(8),
                        guide=guide_colourbar(reverse = T) ) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.title = element_text(face="bold", size = rel(1), hjust = 0.5),
      axis.line = element_line(color = "black"),
      axis.title.x = element_text(vjust= 0, size=rel(0.9)),
      axis.title.y = element_text(vjust= 1.1, size=rel(0.9)),
      axis.text.x = element_text(margin=margin(5,5,0,0,"pt")),
      axis.text.y = element_text(margin=margin(3,5,0,3,"pt")),
      axis.ticks.length = unit(-0.7, "mm"),
      text=element_text(size = 10, family="Times"))

The final chart looks like this:

fig_14_final
All code with custom theme

Great visualizations require tinkering

I hope this post has given you a sense for how to use ggplot to make publication quality graphics. I also hope it has given you an idea of the work it takes to make great looking charts.

Rarely will any plotting software give you a great chart with its default settings. Making a chart pop requires tinkering. Sometimes I spend days on a single chart, going down a google rabit hole trying to figure out the code for the feature I want. The process can be frustratingly slow. But it's also rewarding. A great chart is often the best way to get your research across to your audience.

That's it for this post. If you have questions, leave a comment. I may be able to work answers into future posts.

Good luck with your ggplot adventures!


Support this blog

Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, please consider becoming a patron. You'll help me continue my research, and continue to share it with readers like you.

patron_button


Stay updated

Sign up to get email updates from this blog.


3 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s