In the Beginning … There Were Loops

Today I’m going to take a break from political economy and talk about computer science. The reason? Nowadays a large part of doing science involves writing code. Scientists write code to collect data, to analyze it, to plot it and to run models that explain it. Suffice it to say that modern scientists (myself included) spend a lot of time coding.

Computing is, in many ways, about repetition. It’s about doing a task over and over. It’s about loops.

Because of the ubiquity of repetition, the loop is probably the most important piece of computer syntax. A loop does just what the name implies — it does something over and over. Here’s the classic example of when a loop is useful. Suppose we have two sets of numbers:

A = (1, 2, 3)
B = (4, 5, 6)

Now suppose I want to add each element in A and B (i.e. 1+4, 2+5, 3+6). Enter the loop. For each element i in our sets of numbers, we tell the computer to add A and B:

for i in 1 to 3:
    sum[i] = A[i] + B[i]

This code gives us a new list (sum) that contains the sum of the elements in A and B. Pretty nifty. (FYI, I’m not using any specific coding syntax here. I’m just mimicking the logic of a loop.)

It’s fun to write this loop syntax when you’re first learning to code. But after awhile it becomes tedious. Adding two lists together is something that you’ll do all the time. Do you really want to write a loop each time you add two sets of numbers? Probably not.

Luckily there’s an easier method. Somewhere along the way, computer programmers realized that you could embed loops in a simpler syntax. Suppose that A and B are defined as above. Now imagine that instead of writing a loop, to add each element in A and B we just write:

sum = A + B

This syntax is wonderfully simple and far more human readable than the syntax in our loop. But it’s not like the loop has gone away. It’s just hidden beneath the surface. The loop is implicit in the ‘+’ symbol. When you call this symbol, the computer knows that you actually mean ‘loop over the elements of A and B and add them together’.

Almost all modern programming languages have this type of simplified syntax. It’s called ‘overloading’. The ‘+’ symbol is overloaded to stand for both the addition of single numbers (3 + 4) and the addition of lists of numbers. The computer is smart enough to figure out what you mean based on the context.

You can get this simplified syntax in many different programming languages. In C++, you can get it using the Armadillo library (which I use all the time). To get it in Python, use numpy. If you’re really old school, this syntax is now even part of Fortran.

I spend most of my time programming in the statistical language R. More than any other language, R has taken the embedded-loop-syntax to heart. R programmers call this ‘vectorization’. But that’s just a fancy word for having a syntax that calls loops below the surface. In R, any syntax that applies to a single number, also applies to a list (vector) of numbers.

This feature fantastically simplifies data analysis. Suppose I have data for the energy use (energy) and population (population) of countries of the world. To calculate energy use per capita in each country, I just write:

energy_pc = energy / population

The loop, of course, has not gone away. It’s there under the hood. When you use this syntax in R, you’re actually calling a loop written in a lower-level language. The figure below shows the composition of the core R language. It turns out that R consists mostly of C and Fortran code. So when you call energy / population, you’re calling a C or Fortran loop.

Figure 1: Language composition of base R. Source: Revolution Analytics

A philosophical aside

On a philosophical note, there’s an interesting parallel between how programming languages have evolved, and how life itself has evolved. Organisms grew more complex by building off of simpler sub-components. DNA is built from simpler organic molecules. Cells are built from organelles. Multicellular organisms are built from cells. And so on.

The same appears true of programming languages. Modern languages like R and Python are built on lower-level languages like C and Fortran. These lower-level languages, in turn, are built from still-more-basic assembly languages. And assembly language is ultimately constructed of machine code — aka binary.

Opening up this nested code structure is like peering into history — retracing the evolution of computation. Each step lower in the nested hierarchy is a step back in time. Modern languages like R and Python arose in the 1990s. The lower-level language C appeared in 1972. Fortran appeared in 1957. The first assembly language appeared in 1949. The first computer algorithm dates to the 1840s (created by Ada Lovelace). And binary numbers — the basis for machine code — go back still further to the 17th-century work of Gottfried Leibniz.

So when I code in R, I’m invoking four centuries of human problem solving. On a grander scale still, as my fingers type that code, they’re invoking 3 billion years of evolution.

Making your own lower-level function

OK, enough grand philosophy. Back to more practical matters. When you use higher-level languages like R, you’re building on the problem solving of other people. Most of the time this is a boon. It means you don’t have to reinvent the wheel.

Sometimes, however, it pays to take a step into lower-level languages. Often as scientists, we wade into uncharted territory, doing analysis that’s never been done before. This means we need original code.

Now obviously you can do anything you want in a modern language like R. But what you gain in convenience, you lose in speed. R is what’s called an interpreted language. This means that your human-readable code is converted into machine code on the fly — at the same time as your script is executed. This interpretation means you can run your code one line at a time — which is great for figuring out where the hell you’ve messed up. But it slows down execution speed.

It all comes back to loops. In R, (explicit) loops are notoriously slow. This happens because each time through the loop, your computer reinterprets your code. That slows things down … a lot.

Newcomers to R are often shocked at the slow speed of loops, thinking it’s a major drawback of the language. But it isn’t (usually). R is designed so that loops are invoked under the hood. They’re embedded in what R programmers call ‘vectorized’ syntax. So in R, you would never write a loop to add two lists of numbers. You’d just write A + B.

Sometimes, though, explicit loops can’t be avoided. When this happens, there are three strategies:

  1. Put the loop at the highest level possible
  2. Make the loop parallel
  3. Write the loop in a lower-level language

The idea behind strategy 1 is to loop as few times as possible. Inside the high-level loop, you do a bunch of smaller tasks — each of which is vectorized (i.e. uses R’s embedded loop syntax). By minimizing the number of times your loop is reinterpreted, you speed up your code.

Strategy 2 is the brute force approach, taking advantage of modern computer power. All modern computer processors have ‘parallel architecture’, meaning the CPU has multiple ‘cores’ that can each do tasks independently. When you make your loop ‘parallel’, you divide up the work among different CPU cores. The more cores you have, the faster the loop. (Here’s a good tutorial for implementing parallel loops in R.)

Strategy 3 is, in my opinion, the most elegant. When you want a really fast loop (and there’s no base R syntax that implements the loop under the hood) you write your own lower-level function.

This is surprisingly easy to do. R has a library called Rcpp whose express purpose is to port C++ code into R. This allows you to write fast code in the lower-level C++ language, and then call the code from R. This feature is so popular (especially among developers) that Rcpp usage has exploded over the last decade.

Figure 2: Use of Rcpp. Source: R-bloggers

When to move to a lower-level language

You should move to a lower-level language when strategy 1 (large-scale looping) and strategy 2 (parallel looping) don’t work. Recursive operations are a good example.

In a recursive operation, the output depends on what has come before. The Fibonacci sequence is a simple example. To construct this sequence, start with the numbers 0 and 1. The next number in the sequence is the sum of the previous two. The sequence looks like this:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

Building the Fibonacci sequence is easy enough. You write a loop like this:

f[1] = 0
f[2] = 1

for i in 3 to n:
    f[i] = f[i-1] + f[i-2]

Looking at this loop, you’ll see that strategy 1 and 2 don’t apply. The loop is small scale — we want to do a small operation many times. So strategy 1 won’t work. And the loop is recursive, so strategy 2 won’t work. Parallel processing only works when tasks can be divided up independently. But when the operation is recursive, there’s no way to do this. Each iteration through the loop depends on what has come before. So parallel processing is no help.
That leaves strategy 3 — write the loop in a lower-level language.

An example of recursion

The Fibonacci sequence is a simple example of recursion, but one that’s not particularly relevant to economics. An example that’s more relevant is a random walk in which step size depends on position. Imagine a person who walks forwards or backwards based on the flip of a coin:

heads = step forward 
tails = step backwards

The person’s step size, though, depends on their current position:

step size = f(position)

In code, we can model position like this:

for i in 1 to n:
    step_size = f(position)
    position[i] = position[i-1] + coin_flip*step_size

Like the Fibonacci sequence, we have here a recursive loop at the small scale. So strategy 1 and 2 don’t apply. If you want this random-walk code to be fast, you need to move to a lower-level language.

This type of recursive random walk happens all the time in economics. The growth of a firm, for instance, can be modeled as a random walk. At each point in time, the firm will randomly add or lose employees. But the rate of this addition depends on the firm’s size. Think about it this way. It’s a big deal if a mom-and-pop shop hires 5 new employees. But if Walmart adds 5 employees, it’s a rounding error. In any given day, Walmart probably hires/fires thousands of employees. So with firms, the probability of adding/losing employees depends recursively on firm size. It’s a position-dependent random walk.

A position-dependent random walk

Here’s a simple example of a position-dependent random walk. We’ll assume that moving forward/backward depends on a coin toss. The size of the step depends on the absolute value of your position, raised to some power n:

step_size = abs(position)^n

If n is 0, we recover a simple random walk in which step size is independent of position. If n is greater than 0, then the step size grows with position. This dependence on position leads to funny behavior. A position-dependent random walker tends to stay close to home. But occasionally there are large leaps, followed by large crashes.

Random walks are everywhere in the natural and social world, so it’s important to be able to model them efficiently. In R, that means writing your random-walk code in C++ using Rcpp. I’ve attached Rcpp code that creates the function rand_walk. It takes the following inputs:

  • initial_position = where your random walker starts
  • base_step = the step size in a basic random walk
  • step_power = the exponent in the step-size function [step_size = base_step*abs(position)^step_power]
  • n_steps = the number of steps in your random walk

Call the rand_walk function like this:

position = rand_walk(initial_position,
                     base_step,
                     step_power,
                     n_steps)

Here’s a simple random walk in which step size is independent of position:

position = rand_walk(initial_position = 1,
                     base_step = 1,
                     step_power = 0,
                     n_steps, 10000)

It looks like this:

Figure 3: A random walk with a constant step size.

Here’s a random walk where the step size depends on the square root of position:

position = rand_walk(initial_position = 1,
                     base_step = 1,
                     step_power = 0.5,
                     n_steps, 10000)

The resulting random walk looks quite different (Figure 4). Most of the time, our walker stays close to their initial position. But every so often they leap far away and then leap back. This behavior happens because of position dependence. When the walker is close to home, their step sizes are small. But as they get farther away from their initial position, their step sizes grow and their movement becomes more volatile.

Figure 4: A random walk where step-size depends on position.

Rcpp code

I’ve attached below the code for rand_walk. To use it, paste the code in a file called rand_walk.cpp. Install Rcpp and then run sourceCpp(‘rand_walk.cpp’). rand_walk will then be available to call like a normal R function. This gives you the speed of C++ and the convenience of R — a powerful combination.

#include <Rcpp.h>
#include <math.h>

using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

NumericVector rand_walk( double initial_position,
                         double step_size,
                         double step_power,
                         double n_steps
                       )
{

    // random walk vector
    NumericVector walk(n_steps);

    // initial position
    walk[0] = initial_position;

    for(int i = 1; i < n_steps; i++){

        // coin toss
        double rand = R::runif(-1,1);

        double coin_toss;
        if(rand <= 0){
            coin_toss = -1;
        } else {
            coin_toss = 1;
        }

        // current position
        double position = walk[i - 1];

        // current step size
        double step =  step_size*std::pow( std::abs( position ), step_power );

        // new position
        walk[i] = walk[i - 1] + coin_toss*step;

    }

    return walk;
}

[Cover image: tinyography]

Further reading

Eddelbuettel, D., François, R., Allaire, J., Ushey, K., Kou, Q., Russel, N., … Bates, D. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.

Oliphant, T. E. (2007). Python for scientific computing. Computing in Science & Engineering, 9(3), 10–20.

R Core Team. (2018). R: A language and environment for statistical computing. https://www.R-project.org/

Stephenson, N. (1999). In the beginning… was the command line. Avon Books New York.


Support this blog

Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.

patron_button


Stay updated

Sign up to get email updates from this blog.


2 comments

  1. Blair,Thanks for the occasional article like this. As one who has lived with computer technology since punch-card days and find depressing how poorly those outside field make use of it, it’s refreshing to see someone not only put it to good use, but find excitement in doing so. As you’ve learned, it’s not that difficult. The capability of the technology has certainly improved over the past 50 years, but putting it to use hasn’t changed that much since early bash days, Articles like these may inspire others to look at expanding their own personal toolkit.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s