Chapter 7 Defining your own functions

In this section we are going to learn some advanced concepts that are going to make you into a full-fledged R programmer. Before this chapter you only used whatever R came with, as well as the functions contained in packages. We did define some functions ourselves in Chapter 6 already, but without going into many details. In this chapter, we will learn about building functions ourselves, and do so in greater detail than what we did before.

7.1 Control flow

Knowing about control flow is essential to build your own functions. Without control flow statements, such as if-else statements or loops (or, in the case of pure functional programming languages, recursion), programming languages would be very limited.

7.1.1 If-else

Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical problem that requires the if ... else ... construct. For instance:

a <- 4
b <- 5

Suppose that if a > b then f should be equal to 20, else f should be equal to 10. Using if ... else ... you can achieve this like so:

if (a > b) {
  f <- 20
    } else {
  f <- 10
}

Obviously, here f = 10. Another way to achieve this is by using the ifelse() function:

f <- ifelse(a > b, 20, 10)

if...else... and ifelse() might seem interchangeable, but they’re not. ifelse() is vectorized, while if...else.. is not. Let’s try the following:

ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")
## [1] "no"  "yes" "yes"

The result is a vector. Now, let’s see what happens if we use if...else... instead of ifelse():

if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
## Warning in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no"): the
## condition has length > 1 and only the first element will be used
## [1] "no"

Only the first element of my atomic vector is used for the comparison. This is very important to keep in mind. Suppose that you want an expression to be evaluated, only if every element is TRUE. In this case, you should use the all() function, as seen previously in Chapter 2:

ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater")
## [1] "not all elements are greater"

You also remember the any() function:

ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater")
## [1] "at least one element is greater"

These are the basics. But sometimes, you might need to test for more complex conditions, which can lead to using nested if...else... constructs. These, however, can get messy:

if (10 %% 3 == 0) {
  print("10 is divisible by 3")
  } else if (10 %% 2 == 0) {
    print("10 is divisible by 2")
}
## [1] "10 is divisible by 2"

10 being obviously divisible by 2 and not 3, it is the second phrase that will be printed. The %% operator is the modulus operator, which gives the rest of the division of 10 by 2. It is easier to use dplyr::case_when():

case_when(10 %% 3 == 0 ~ "10 is divisible by 3",
          10 %% 2 == 0 ~ "10 is divisible by 2")
## [1] "10 is divisible by 2"

We have already encountered this function in Chapter 4, inside a dplyr::mutate() call to create a new column.

Let’s now discuss loops.

7.1.2 For loops

For loops make it possible to repeat a set of instructions i times. For example, try the following:

for (i in 1:10){
  print("hello")
}
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"

It is also possible to do computations using for loops. Let’s compute the sum of the first 100 integers:

result <- 0
for (i in 1:100){
  result <- result + i
}

print(result)
## [1] 5050

result is equal to 5050, the expected result. What happened in that loop? First, we defined a variable called result and set it to 0. Then, when the loops starts, i equals 1, so we add result to 1, which is 1. Then, i equals 2, and again, we add result to i. But this time, result equals 1 and i equals 2, so now result equals 3, and we repeat this until i equals 100. As explained in the introduction, this is not my preferred way of doing something like that. I actually use loops very rarely. I would solve the above problem using purrr::reduce():

reduce(seq(1,100), `+`)
## [1] 5050

This might look more complex, but once you’re used to it, you will not use many loops again. You might wonder then, why use loops at all in this case? I refer you to the following section of Hadley Wickham’s Advanced R for an in-depth discussion on situations where loops make more sense than higher-order functions such as purrr::map() or purrr::reduce().

7.1.3 While loops

While loops are very similar to for loops. The instructions inside a while loop are repeated while a certain condition holds true. Let’s consider the sum of the first 100 integers again:

result <- 0
i <- 1
while (i<=100){
  result = result + i
  i = i + 1
}

print(result)
## [1] 5050

Here, we first set result and i to 0. Then, while i is less than, or equal to 100, we add i to result. Notice that there is one more line than in the for loop: we need to increment the value of i, if not, i would stay equal to 1, and the condition would always be fulfilled, and the loop would run forever (not really, only until your computer runs out of memory, or until the heat death of the universe, whichever comes first).

Now that we know how to write loops, and know about if...else... constructs, we have (almost) all the ingredients to write our own functions.

7.2 Writing your own functions

As you have seen by now, R includes a very large amount of preprogrammed functions, but also many more functions are available in packages. However, there will be a lot of situations where you will need to write your own. In this section we are going to learn how to write our own functions.

7.2.1 Declaring functions in R

Suppose you want to create the following function: \(f(x) = \dfrac{1}{\sqrt{x}}\). Writing this in R is quite simple:

my_function <- function(x){
  1/sqrt(x)
}

The argument of the function, x, gets passed to the function() function and the body of the function (more on that in the next Chapter) contains the function definition. Of course, you could define functions that use more than one input:

my_function <- function(x, y){
  1/sqrt(x + y)
}

or inputs with names longer than one character:

my_function <- function(argument1, argument2){
  1/sqrt(argument1 + argument2)
}

Functions written by the user get called just the same way as functions included in R:

my_function(1, 10)
## [1] 0.3015113

It is also possible to provide default values to the function’s arguments, which are values that are used if the user omits them:

my_function <- function(argument1, argument2 = 10){
1/sqrt(argument1 + argument2)
}
my_function(1)
## [1] 0.3015113

This is especially useful for functions with many arguments. Consider also the following, example, where the function has a default method:

my_function <- function(argument1, argument2, method = "foo"){
  
  x <- argument1 + argument2
  
  if(method == "foo"){
    1/sqrt(x)
  } else if (method == "bar"){
    "this is a string"
    }
}

my_function(10, 11)
## [1] 0.2182179
my_function(10, 11, "bar")
## [1] "this is a string"

As you see, depending on the “method” chosen, the returned result is either a numeric, or a string. What happens if the user provides a “method” that is neither “foo” nor “bar”?

my_function(10, 11, "spam")

As you can see nothing happens. It is possible to add safeguards to your function to avoid such situations:

my_function <- function(argument1, argument2, method = "foo"){
  
  if(!(method %in% c("foo", "bar"))){
    return("Method must be either 'foo' or 'bar'")
  }
  
  x <- argument1 + argument2
  
  if(method == "foo"){
    1/sqrt(x)
  } else if (method == "bar"){
    "this is a string"
    }
}

my_function(10, 11)
## [1] 0.2182179
my_function(10, 11, "bar")
## [1] "this is a string"
my_function(10, 11, "foobar")
## [1] "Method must be either 'foo' or 'bar'"

Notice that I have used return() inside my first if statement. This is to immediately stop evaluation of the function and return a value. If I had omitted it, evaluation would have continued, as it is always the last expression that gets evaluated. Remove return() and run the function again, and see what happens. Later, we are going to learn how to add better safeguards to your functions and to avoid runtime errors.

While in general, it is a good idea to add comments to your functions to explain what they do, I would avoid adding comments to functions that do things that are very obvious, such as with this one. Function names should be of the form: function_name(). Always give your function very explicit names! In mathematics it is standard to give functions just one letter as a name, but I would advise against doing that in your code. Functions that you write are not special in any way; this means that R will treat them the same way, and they will work in conjunction with any other function just as if it was built-in into R.

They have one limitation though (which is shared with R’s native function): just like in math, they can only return one value. However, sometimes, you may need to return more than one value. To be able to do this, you must put your values in a list, and return the list of values. For example:

average_and_sd <- function(x){
c(mean(x), sd(x))
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237

You’re still returning a single object, but it’s a vector. You can also return a named list:

average_and_sd <- function(x){
list("mean_x" =  mean(x), "sd_x" = sd(x))
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## $mean_x
## [1] 7.166667
## 
## $sd_x
## [1] 4.262237

As described before, you can use return() at the end of your functions:

average_and_sd <- function(x){
  result <- c(mean(x), sd(x))
return(result)
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237

But this is only needed if you need to return a value early:

average_and_sd <- function(x){
if(any(is.na(x))){
    return(NA)
  } else {
    c(mean(x), sd(x))
    }
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237
average_and_sd(c(1, 3, NA, 9, 10, 12))
## [1] NA

If you need to use a function from a package inside your function use :::

my_sum <- function(a_vector){
  purrr::reduce(a_vector, `+`)
}

However, if you need to use more than one function, this can become tedious. A quick and dirty way of doing that, is to use library(package_name), inside the function:

my_sum <- function(a_vector){
  library(purrr)
  reduce(a_vector, `+`)
}

Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your script, then, no worries, your function will load it the first time you use it! However, the very best way would be to write your own package and declare the packages upon which your functions depend as dependencies. This is something we are going to explore in Chapter 11.

You can put a lot of instructions inside a function, such as loops. Let’s create the function that returns Fionacci numbers.

7.2.2 Fibonacci numbers

The Fibonacci sequence is the following:

\[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\]

Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the \(n^{th}\) fibonacci number:

my_fibo <- function(n){
 a <- 0
 b <- 1
 for (i in 1:n){
  temp <- b
  b <- a
  a <- a + temp
 }
 a
}

Inside the loop, we defined a variable called temp. Defining temporary variables is usually very useful. Let’s try to understand what happens inside this loop:

  • First, we assign the value 0 to variable a and value 1 to variable b.
  • We start a loop, that goes from 1 to n.
  • We assign the value inside of b to a temporary variable, called temp.
  • b becomes a.
  • We assign the sum of a and temp to a.
  • When the loop is finished, we return a.

What happens if we want the 3rd fibonacci number? At n = 1 we have first a = 0 and b = 1, then temp = 1, b = 0 and a = 0 + 1. Then n = 2. Now b = 0 and temp = 0. The previous result, a = 0 + 1 is now assigned to b, so b = 1. Then, a = 1 + 0. Finally, n = 3. temp = 1 (because b = 1), the previous result a = 1 is assigned to b and finally, a = 1 + 1. So the third fibonacci number equals 2. Reading this might be a bit confusing; I strongly advise you to run the algorithm on a sheet of paper, step by step.

The above algorithm is called an iterative algorithm, because it uses a loop to compute the result. Let’s look at another way to think about the problem, with a so-called recursive function:

fibo_recur <- function(n){
 if (n == 0 || n == 1){
   return(n)
   } else {
   fibo_recur(n-1) + fibo_recur(n-2)
   }
}

This algorithm should be easier to understand: if n = 0 or n = 1 the function should return n (0 or 1). If n is strictly bigger than 1, fibo_recur() should return the sum of fibo_recur(n-1) and fibo_recur(n-2). This version of the function is very much the same as the mathematical definition of the fibonacci sequence. So why not use only recursive algorithms then? Try to run the following:

system.time(my_fibo(30))
##    user  system elapsed 
##   0.005   0.000   0.005

The result should be printed very fast (the system.time() function returns the time that it took to execute my_fibo(30)). Let’s try with the recursive version:

system.time(fibo_recur(30))
##    user  system elapsed 
##   1.179   0.000   1.180

It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is critical, it’s best to avoid recursive algorithms. Also, in fibo_recur() try to remove this line: if (n == 0 || n == 1) and try to run fibo_recur(5) and see what happens. You should get an error: this is because for recursive algorithms you need a stopping condition, or else, it would run forever. This is not the case for iterative algorithms, because the stopping condition is the last step of the loop.

So as you can see, for recursive relationships, for or while loops are the way to go in R, whether you’re writing these loops inside functions or not.

7.3 Exercises

Exercise 1

In this exercise, you will write a function to compute the sum of the n first integers. Combine the algorithm we saw in section about while loops and what you learned about functions in this section.

Exercise 2

Write a function called my_fact() that computes the factorial of a number n. Do it using a loop, using a recursive function, and using a functional:

Exercise 3

Write a function to find the roots of quadratic functions. Your function should take 3 arguments, a, b and c and return the two roots. Only consider the case where there are two real roots (delta > 0).

7.4 Functions that take functions as arguments: writing your own higher-order functions

Functions that take functions as arguments are very powerful and useful tools. You already know a couple, purrr::map() and purrr::reduce(), discussed briefly in Chapter 4. But you can also write your own! A very simple example would be the following:

my_func <- function(x, func){
  func(x)
}

my_func() is a very simple function that takes x and func() as arguments and that simply executes func(x). This might not seem very useful (after all, you could simply use func(x)!) but this is just for illustration purposes, in practice, your functions would be more useful than that! Let’s try to use my_func():

my_func(c(1, 8, 1, 0, 8), mean)
## [1] 3.6

As expected, this returns the mean of the given vector. But now suppose the following:

my_func(c(1, 8, 1, NA, 8), mean)
## [1] NA

Because one element of the list is NA, the whole mean is NA. mean() has a na.rm argument that you can set to TRUE to ignore the NAs in the vector. However, here, there is no way to provide this argument to the function mean()! Let’s see what happens when we try to:

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) :
  unused argument (na.rm = TRUE)

So what you could do is pass the value TRUE to the na.rm argument of mean() from your own function:

my_func <- function(x, func, remove_na){
  func(x, na.rm = remove_na)
}

my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE)
## [1] 4.5

This is one solution, but mean() also has another argument called trim. What if some other user needs this argument? Should you also add it to your function? Surely there’s a way to avoid this problem? Yes, there is, and it by using the dots. The ... simply mean “any other argument as needed”, and it’s very easy to use:

my_func <- function(x, func, ...){
  func(x, ...)
}

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
## [1] 4.5

or, now, if you need the trim argument:

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
## [1] 4.5

The ... are very useful when writing wrappers such as my_func().

7.5 Functions that take columns of data as arguments

7.5.1 The enquo() - !!() approach

In many situations, you will want to write functions that look similar to this:

my_function(my_data, one_column_inside_data)

Such a function would be useful in situation where you have to apply a certain number of operations to columns for different data frames. For example if you need to create tables of descriptive statistics or graphs periodically, it might be very interesting to put these operations inside a function and then call the function whenever you need it, on the fresh batch of data.

However, if you try to write something like that, something that might seem unexpected, at first, will happen:

data(mtcars)

simple_function <- function(dataset, col_name){
  dataset %>%
    group_by(col_name) %>%
    summarise(mean_speed = mean(speed))
}


simple_function(cars, "dist")
Error: unknown variable to group by : col_name

The variable col_name is passed to simple_function() as a string, but group_by() requires a variable name. So why not try to convert col_name to a name?

simple_function <- function(dataset, col_name){
  col_name <- as.name(col_name)
  dataset %>%
    group_by(col_name) %>%
    summarise(mean_speed = mean(speed))
}


simple_function(cars, "dist")
Error: unknown variable to group by : col_name

This is because R is literally looking for the variable "dist" somewhere in the global environment, and not as a column of the data. R does not understand that you are refering to the column "dist" that is inside the dataset. So how can we make R understands what you mean?

To be able to do that, we need to use a framework that was introduced in the {tidyverse}, called {tidyeval}. This discussion can get very technical, so I will spare you the details. However, you can read about it here and here. The discussion can get complicated, but using {tidyeval} is actually quite easy, and you can have a cookbook approach to it. Take a look at the code below:

simple_function <- function(dataset, col_name){
  col_name <- enquo(col_name)
  dataset %>%
    group_by(!!col_name) %>%
    summarise(mean_mpg = mean(mpg))
}


simple_function(mtcars, cyl)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4     26.7
## 2     6     19.7
## 3     8     15.1

As you can see, the previous idea we had, which was using as.name() was not very far away from the solution. The solution, with tidyeval, consists in using enquo(), which (for our purposes), does something similar to as.name(). Now that col_name is (R programmers call it) quoted, we need to tell group_by() to evaluate the input as is. This is done with !!(), which is another {tidyeval} function. I say it again; don’t worry if you don’t understand everything. Just remember to use enquo() on your column names and then !!() inside the {dplyr} function you want to use.

Let’s see some other examples:

simple_function <- function(dataset, col_name, value){
  col_name <- enquo(col_name)
  dataset %>%
    filter((!!col_name) == value) %>%
    summarise(mean_cyl = mean(cyl))
}


simple_function(mtcars, am, 1)
##   mean_cyl
## 1 5.076923

Notice that I’ve written:

filter((!!col_name) == value)

and not:

filter(!!col_name == value)

I have enclosed !!col_name inside parentheses. This is because operators such as == have precedence over !!, so you have to be explicit. Also, notice that I didn’t have to quote 1. This is because it’s standard variable, not a column inside the dataset. Let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable?

simple_function <- function(dataset, filter_col, mean_col, value){
  filter_col <- enquo(filter_col)
  mean_col <- enquo(mean_col)
  dataset %>%
    filter((!!filter_col) == value) %>%
    summarise(mean((!!mean_col)))
}


simple_function(mtcars, am, cyl, 1)
##   mean(cyl)
## 1  5.076923

Notice that I had to quote mean_col too.

Using the ... that we discovered in the previous section, we can pass more than one column:

simple_function <- function(dataset, ...){
  col_vars <- quos(...)
  dataset %>%
    summarise_at(vars(!!!col_vars), funs(mean, sd))
}

Because these dots contain more than one variable, you have to use quos() instead of enquo(). This will put the arguments provided via the dots in a list. Then, because we have a list of columns, we have to use summarise_at(), which you should know if you did the exercices of Chapter 4. So if you didn’t do them, go back to them and finish them first. Doing the exercise will also teach you what vars() and funs() are. The last thing you have to pay attention to is to use !!!() if you used quos(). So 3 ! instead of only 2. This allows you to then do things like this:

simple_function(mtcars, am, cyl, mpg)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd
## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948

Using ... with !!!() allows you to write very flexible functions.

If you need to be even more general, you can also provide the summary functions as arguments of your function, but you have to rewrite your function a little bit:

simple_function <- function(dataset, cols, funcs){
  dataset %>%
    summarise_at(vars(!!!cols), funs(!!!funcs))
}

You might be wondering where the quos() went? Well because now we are passing two lists, a list of columns that we have to quote, and a list of functions, that we also have to quote, we need to use quos() when calling the function:

simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum))
##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd am_sum cyl_sum mpg_sum
## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948     13     198   642.9

This works, but I don’t think you’ll need to have that much flexibility; either the columns are variables, or the functions, but rarely both at the same time.

To conclude this function, I should also talk about as_label() which allows you to change the name of a variable, for instance if you want to call the resulting column mean_mpg when you compute the mean of the mpg column:

simple_function <- function(dataset, filter_col, mean_col, value){

  filter_col <- enquo(filter_col)
  mean_col <- enquo(mean_col)
  mean_name <- paste0("mean_", as_label(mean_col))
  
  dataset %>%
    filter((!!filter_col) == value) %>%
    summarise(!!(mean_name) := mean((!!mean_col)))
}

Pay attention to the := operator in the last line. This is needed when using as_label().

7.5.2 Curly Curly, a simplified approach to the enquo() and !!() framework

The previous section might have been a bit difficult to grasp, but there is a simplified way of doing it, which consists in using {{}}, introduced in {rlang} version 0.4.0. The suggested pronunciation of {{}} is curly-curly, but there is no consensus yet.

Let’s suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments, just like before:

how_many_na <- function(dataframe, column_name){
  dataframe %>%
    filter(is.na(column_name)) %>%
    count()
}

Let’s try this function out on the starwars data:

data(starwars)

head(starwars)
## # A tibble: 6 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

As you can see, there are missing values in the hair_color column. Let’s try to count how many missing values are in this column:

how_many_na(starwars, hair_color)
Error: object 'hair_color' not found

Just as expected, this does not work. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with "hair_color"?

how_many_na(starwars, "hair_color")
## # A tibble: 1 x 1
##       n
##   <int>
## 1     0

Now we get something, but something wrong!

One way to solve this issue, is to not use the filter() function, and instead rely on base R:

how_many_na_base <- function(dataframe, column_name){
  na_index <- is.na(dataframe[, column_name])
  nrow(dataframe[na_index, column_name])
}

how_many_na_base(starwars, "hair_color")
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## [1] 5

This works, but not using the {tidyverse} at all is not always an option. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}:

summarise_groups <- function(dataframe, grouping_var, column_name){
  dataframe %>%
    group_by(grouping_var) %>%  
    summarise(mean(column_name, na.rm = TRUE))
}

Calling this function results in the following error message, as expected:

Error: Column `grouping_var` is unknown

In the previous section, we solved the issue like so:

summarise_groups <- function(dataframe, grouping_var, column_name){

  grouping_var <- enquo(grouping_var)
  column_name <- enquo(column_name)
  mean_name <- paste0("mean_", as_label(column_name))

  dataframe %>%
    group_by(!!grouping_var) %>%  
    summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))
}

The core of the function remained very similar to the version from before, but now one has to use the enquo()-!! syntax.

Now this can be simplified using the new {{}} syntax:

summarise_groups <- function(dataframe, grouping_var, column_name){

  dataframe %>%
    group_by({{grouping_var}}) %>%  
    summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))
}

Much easier and cleaner! You still have to use the := operator instead of = for the column name however, and if you want to modify the column names, for instance in this case return "mean_height" instead of height you have to keep using the enquo()-!! syntax.

7.6 Functions that use loops

It is entirely possible to put a loop inside a function. For example, consider the following function that return the square root of a number using Newton’s algorithm:

sqrt_newton <- function(a, init = 1, eps = 0.01){
    stopifnot(a >= 0)
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
    }
    init
}

This functions contains a while loop inside its body. Let’s see if it works:

sqrt_newton(16)
## [1] 4.000001

In the definition of the function, I wrote init = 1 and eps = 0.01 which means that this argument can be omitted and will have the provided value (0.01) as the default. You can then use this function as any other, for example with map():

map(c(16, 7, 8, 9, 12), sqrt_newton)
## [[1]]
## [1] 4.000001
## 
## [[2]]
## [1] 2.645767
## 
## [[3]]
## [1] 2.828469
## 
## [[4]]
## [1] 3.000092
## 
## [[5]]
## [1] 3.464616

This is what I meant before with “your functions are nothing special”. Once the function is defined, you can use it like any other base R function.

Notice the use of stopifnot() inside the body of the function. This is a way to return an error in case a condition is not fulfilled. We are going to learn more about this type of functions in the next chapter.

7.7 Anonymous functions

As the name implies, anonymous functions are functions that do not have a name. These are useful inside functions that have functions as arguments, such as purrr::map() or purrr::reduce():

map(c(1,2,3,4), function(x){1/sqrt(x)})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.7071068
## 
## [[3]]
## [1] 0.5773503
## 
## [[4]]
## [1] 0.5

These anonymous functions get defined in a very similar way to regular functions, you just skip the name and that’s it. {tidyverse} functions also support formulas; these get converted to anonymous functions:

map(c(1,2,3,4), ~{1/sqrt(.)})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.7071068
## 
## [[3]]
## [1] 0.5773503
## 
## [[4]]
## [1] 0.5

Using a formula instead of an anonymous function is less verbose; you use ~ instead of function(x) and a single dot . instead of x. What if you need an anonymous function that requires more than one argument? This is not a problem:

map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y})
## [[1]]
## [1] 0.1111111
## 
## [[2]]
## [1] 0.5
## 
## [[3]]
## [1] 1.285714
## 
## [[4]]
## [1] 2.666667
## 
## [[5]]
## [1] 5

or, using a formula:

map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y})
## [[1]]
## [1] 0.1111111
## 
## [[2]]
## [1] 0.5
## 
## [[3]]
## [1] 1.285714
## 
## [[4]]
## [1] 2.666667
## 
## [[5]]
## [1] 5

Because you have now two arguments, a single dot could not work, so instead you use .x and .y to avoid confusion.

You now know a lot about writing your own functions. In the next chapter, we are going to learn about functional programming, the programming paradigm I described in the introduction of this book.

7.8 Exercises

Exercise 1

  • Create the following vector:

\[a = (1,6,7,8,8,9,2)\]

Using a for loop and a while loop, compute the sum of its elements. To avoid issues, use i as the counter inside the for loop, and j as the counter for the while loop.

  • How would you achieve that with a functional (a function that takes a function as an argument)?

Exercise 2

  • Let’s use a loop to get the matrix product of a matrix A and B. Follow these steps to create the loop:
  1. Create matrix A:

\[A = \left( \begin{array}{ccc} 9 & 4 & 12 \\ 5 & 0 & 7 \\ 2 & 6 & 8 \\ 9 & 2 & 9 \end{array} \right) \]

  1. Create matrix B:

\[B = \left( \begin{array}{cccc} 5 & 4 & 2 & 5 \\ 2 & 7 & 2 & 1 \\ 8 & 3 & 2 & 6 \\ \end{array} \right) \]

  1. Create a matrix C, with dimension 4x4 that will hold the result. Use this command: `C = matrix(rep(0,16), nrow = 4)}

  2. Using a for loop, loop over the rows of A first: `for(i in 1:nrow(A))}

  3. Inside this loop, loop over the columns of B: `for(j in 1:ncol(B))}

  4. Again, inside this loop, loop over the rows of B: `for(k in 1:nrow(B))}

  5. Inside this last loop, compute the result and save it inside C: `C[i,j] = C[i,j] + A[i,k] * B[k,j]}

  6. Now write a function that takes two matrices as arguments, and returns their product.

  • R has a built-in function to compute the dot product of 2 matrices. Which is it?

Exercise 3

  • Fizz Buzz: Print integers from 1 to 100. If a number is divisible by 3, print the word "Fizz" if it’s divisible by 5, print "Buzz". Use a for loop and if statements.

  • Write a function that takes an integer as arguments, and prints "Fizz" or "Buzz" up to that integer.

Exercise 4

  • Fizz Buzz 2: Same as above, but now add this third condition: if a number is both divisible by 3 and 5, print "FizzBuzz".

  • Write a function that takes an integer as argument, and prints Fizz, Buzz or FizzBuzz up to that integer.

%>% # Functional programming

Functional programming is a paradigm that I find very suitable for data science. In functional programming, your code is organised into functions that perform the operations you need. Your scripts will only be a sequence of calls to these functions, making them easier to understand. R is not a pure functional programming language, so we need some self-discipline to apply pure functional programming principles. However, these efforts are worth it, because pure functions are easier to debug, extend and document. In this chapter, we are going to learn about functional programming principles that you can adopt and start using to make your code better.

7.9 Function definitions

You should now be familiar with function definitions in R. Let’s suppose you want to write a function to compute the square root of a number and want to do so using Newton’s algorithm:

sqrt_newton <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
    }
    init
}

You can then use this function to get the square root of a number:

sqrt_newton(16, 2)
## [1] 4.00122

We are using a while loop inside the body of the function. The body of a function are the instructions that define the function. You can get the body of a function with body(some_func). In pure functional programming languages, like Haskell, loops do not exist. How can you program without loops, you may ask? In functional programming, loops are replaced by recursion, which we already discussed in the previous chapter. Let’s rewrite our little example above with recursion:

sqrt_newton_recur <- function(a, init, eps = 0.01){
    if(abs(init**2 - a) < eps){
        result <- init
    } else {
        init <- 1/2 * (init + a/init)
        result <- sqrt_newton_recur(a, init, eps)
    }
    result
}
sqrt_newton_recur(16, 2)
## [1] 4.00122

R is not a pure functional programming language though, so we can still use loops (be it while or for loops) in the bodies of our functions. As discussed in the previous chapter, it is actually better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized. I won’t got into the details of what tail-call optimization is but just remember that if performance is important a loop will be faster. However, sometimes, it is easier to write a function using recursion. I personally tend to avoid loops if performance is not important, because I find that code that avoids loops is easier to read and debug. However, knowing that you can use loops is reassuring, and encapsulating loops inside functions gives you the benefits of both using functions, and loops. In the coming sections I will show you some built-in functions that make it possible to avoid writing loops and that don’t rely on recursion, so performance won’t be penalized.

7.10 Properties of functions

Mathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way. For example, the following function:

increment <- function(x){
    x + 1
}

Is a referential transparent function. We always get the same result for any x that we give to this function.

This:

increment(10)
## [1] 11

will always produce 11.

However, this one:

increment_opaque <- function(x){
    x + spam
}

is not a referential transparent function, because its value depends on the global variable spam.

spam <- 1

increment_opaque(10)
## [1] 11

will produce 11 if spam = 1. But what if spam = 19?

spam <- 19

increment_opaque(10)
## [1] 29

To make increment_opaque() a referential transparent function, it is enough to make spam an argument:

increment_not_opaque <- function(x, spam){
    x + spam
}

Now even if there is a global variable called spam, this will not influence our function:

spam <- 19

increment_not_opaque(10, 34)
## [1] 44

This is because the variable spam defined in the body of the function is a local variable. It could have been called anything else, really. Avoiding opaque functions makes our life easier.

Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program.

For example this function (which is not a referential transparent function):

count_iter <- 0

sqrt_newton_side_effect <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
    }                                 # RHS value in a variable inside the global environment
    init
}

If you look in the environment pane, you will see that count_iter equals 0. Now call this function with the following arguments:

sqrt_newton_side_effect(16000, 2)
## [1] 126.4911
print(count_iter)
## [1] 9

If you check the value of count_iter now, you will see that it increased! This is a side effect, because the function changed something outside of its scope. It changed a value in the global environment. In general, it is good practice to avoid side-effects. For example, we could make the above function not have any side effects like this:

sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <- count_iter + 1
    }
    c(init, count_iter)
}

Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result:

sqrt_newton_count(16000, 2)
## [1] 126.4911   9.0000

Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. Just remember: try to avoid having functions changing variables in the global environment unless you have a very good reason of doing so.

Very long scripts that don’t use functions and use a lot of global variables with loops changing the values of global variables are a nightmare to debug. If something goes wrong, it might be very difficult to pinpoint where the problem is. Is there an error in one of the loops? Is your code running for a particular value of a particular variable in the global environment, but not for other values? Which values? And of which variables? It can be very difficult to know what is wrong with such a script. With functional programming, you can avoid a lot of this pain for free (well not entirely for free, it still requires some effort, since R is not a pure functional language). Writing functions also makes it easier to parallelize your code. We are going to learn about that later in this chapter too.

Finally, another property of mathematical functions, is that they do one single thing. Functional programming purists also program their functions to do one single task. This has benefits, but can complicate things. The function we wrote previously does two things: it computes the square root of a number and also returns the number of iterations it took to compute the result. However, this is not a bad thing; the function is doing two tasks, but these tasks are related to each other and it makes sense to have them together. My piece of advice: avoid having functions that do many unrelated things. This makes debugging harder.

In conclusion: you should strive for referential transparency, try to avoid side effects unless you have a good reason to have them and try to keep your functions short and do as little tasks as possible. This makes testing and debugging easier, as you will see in the next chapter, but also improves readability and maintainability of your code.

7.11 Functional programming with {purrr}

I mentioned it several times already, but R is not a pure functional programming language. It is possible to write R code using the functional programming paradigm, but some effort is required. The {purrr} package extends R’s base functional programming capabilities with some very interesting functions. We have already seen map() and reduce(), which we are going to see in more detail now. Then, we are going to learn about some other functions included in {purrr} that make functional programming easier in R.

7.11.1 Doing away with loops: the map*() family of functions

Instead of using loops, pure functional programming languages use functions that achieve the same result. These functions are often called Map or Reduce (also called Fold). R comes with the *apply() family of functions (which are implementations of Map), as well as Reduce() for functional programming.

Within this family, you can find lapply(), sapply(), vapply(), tapply(), mapply(), rapply(), eapply() and apply() (I might have forgotten one or the other, but that’s not important). Each version of an *apply() function has a different purpose, but it is not very easy to remember which does what exactly. To add even more confusion, the arguments are sometimes different between each of these.

In the {purrr} package, these functions are replaced by the map*() family of functions. As you will shortly see, they are very consistent, and thus easier to use. The first part of these functions’ names all start with map_ and the second part tells you what this function is going to return. For example, if you want doubles out, you would use map_dbl(). If you are working on data frames and want a data frame back, you would use map_df(). Let’s start with the basic map() function. The following gif (source: Wikipedia) illustrates what map() does fairly well:

\(X\) is a vector composed of the following scalars: \((0, 5, 8, 3, 2, 1)\). The function we want to map to each element of \(X\) is \(f(x) = x + 1\). \(X'\) is the result of this operation. Using R, we would do the following:

library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- map(numbers, plus_one)

my_results
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 3
## 
## [[6]]
## [1] 2

Using a loop, you would write:

numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- vector("list", 6)

for(number in seq_along(numbers)){
  my_results[[number]] <- plus_one(number)
}

my_results
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6
## 
## [[6]]
## [1] 7

Now I don’t know about you, but I prefer the first option. Using functional programming, you don’t need to create an empty list to hold your results, and the code is more concise. Plus, it is less error prone. I had to try several times to get the loop right (and I’ve using R for almost 10 years now). Why? Well, first of all I used %in% instead of in. Then, I forgot about seq_along(). After that, I made a typo, plos_one() instead of plus_one() (ok, that one is unrelated to the loop). Let’s also see how this works using base R:

numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- lapply(numbers, plus_one)

my_results
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 3
## 
## [[6]]
## [1] 2

So what is the added value of using {purrr}, you might ask. Well, imagine that instead of a list, I need to an atomic vector of numerics. This is fairly easy with {purrr}:

library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- map_dbl(numbers, plus_one)

my_results
## [1] 1 6 9 4 3 2

We’re going to discuss these functions below, but know that in base R, outputting something else involves more effort.

Let’s go back to our sqrt_newton() function. This function has more than one parameter. Often, we would like to map functions with more than one parameter to a list, while holding constant some of the functions parameters. This is easily achieved like so:

library("purrr")
numbers <- c(7, 8, 19, 64)

map(numbers, sqrt_newton, init = 1)
## [[1]]
## [1] 2.645767
## 
## [[2]]
## [1] 2.828469
## 
## [[3]]
## [1] 4.358902
## 
## [[4]]
## [1] 8.000002

It is also possible to use a formula:

library("purrr")
numbers <- c(7, 8, 19, 64)

map(numbers, ~sqrt_newton(., init = 1))
## [[1]]
## [1] 2.645767
## 
## [[2]]
## [1] 2.828469
## 
## [[3]]
## [1] 4.358902
## 
## [[4]]
## [1] 8.000002

Another function that is similar to map() is rerun(). You guessed it, this one simply reruns an expression:

rerun(10, "hello")
## [[1]]
## [1] "hello"
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] "hello"
## 
## [[4]]
## [1] "hello"
## 
## [[5]]
## [1] "hello"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] "hello"
## 
## [[8]]
## [1] "hello"
## 
## [[9]]
## [1] "hello"
## 
## [[10]]
## [1] "hello"

rerun() simply runs an expression (which can be arbitrarily complex) n times, whereas map() maps a function to a list of inputs, so to achieve the same with map(), you need to map the print() function to a vector of characters:

map(rep("hello", 10), print)
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [[1]]
## [1] "hello"
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] "hello"
## 
## [[4]]
## [1] "hello"
## 
## [[5]]
## [1] "hello"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] "hello"
## 
## [[8]]
## [1] "hello"
## 
## [[9]]
## [1] "hello"
## 
## [[10]]
## [1] "hello"

rep() is a function that creates a vector by repeating something, in this case the string “hello”, as many times as needed, here 10. The output here is a bit different that before though, because first you will see “hello” printed 10 times and then the list where each element is “hello”. This is because the print() function has a side effect, which is, well printing to the console. We see this side effect 10 times, plus then the list created with map().

rerun() is useful if you want to run simulation. For instance, let’s suppose that I perform a simulation where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance:

mean_var_throws <- function(n){
  throws <- sample(1:6, n, replace = TRUE)

  mean_throws <- mean(throws)
  var_throws <- var(throws)

  tibble::tribble(~mean_throws, ~var_throws,
                   mean_throws, var_throws)
}

mean_var_throws(5)
## # A tibble: 1 x 2
##   mean_throws var_throws
##         <dbl>      <dbl>
## 1         3.2        2.7

mean_var_throws() returns a tibble object with mean of points and the variance of the points. Now suppose I want to compute the expected value of the distribution of throwing dice. We know from theory that it should be equal to \(3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)\).

Let’s rerun the simulation 50 times:

simulations <- rerun(50, mean_var_throws(5))

Let’s see what the simulations object is made of:

str(simulations)
## List of 50
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2
##   ..$ var_throws : num 3
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 0.2
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 0.7
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 1.7
.....

simulations is a list of 50 data frames. We can easily combine them into a single data frame, and compute the mean of the means, which should return something close to the expected value of 3.5:

bind_rows(simulations) %>%
  summarise(expected_value = mean(mean_throws))
## # A tibble: 1 x 1
##   expected_value
##            <dbl>
## 1           3.33

Pretty close! Now of course, one could have simply done something like this:

mean(sample(1:6, 1000, replace = TRUE))
## [1] 3.578

but the point was to illustrate that rerun() can run any arbitrarily complex expression, and that it is good practice to put the result in a data frame or list, for easier further manipulation.

You now know the standard map() function, and also rerun(), which return lists, but there are a number of variants of this function. map_dbl() returns an atomic vector of doubles, as seen we’ve seen before. A little reminder below:

map_dbl(numbers, sqrt_newton, init = 1)
## [1] 2.645767 2.828469 4.358902 8.000002

In a similar fashion, map_chr() returns an atomic vector of strings:

map_chr(numbers, sqrt_newton, init = 1)
## [1] "2.645767" "2.828469" "4.358902" "8.000002"

map_lgl() returns an atomic vector of TRUE or FALSE:

divisible <- function(x, y){
  if_else(x %% y == 0, TRUE, FALSE)
}

map_lgl(seq(1:100), divisible, 3)
##   [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [25] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [37] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [49] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [61] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [73] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [85] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [97] FALSE FALSE  TRUE FALSE

There are also other interesting variants, such as map_if():

a <- seq(1,10)

map_if(a, (function(x) divisible(x, 2)), sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 3.162278

I used map_if() to take the square root of only those numbers in vector a that are divisble by 2, by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()).

map_at() is similar to map_if() but maps the function at a position specified by the user:

map_at(numbers, c(1, 3), sqrt)
## [[1]]
## [1] 2.645751
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 4.358899
## 
## [[4]]
## [1] 64

or if you have a named list:

recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)

map_at(recipe, "bacon", `*`, 2)
## $spam
## [1] 1
## 
## $eggs
## [1] 3
## 
## $bacon
## [1] 20

I used map_at() to double the quantity of bacon in the recipe (by using the * function, and specifying its second argument, 2. Try the following in the command prompt: `*`(3, 4)).

map2() is the equivalent of mapply() and pmap() is the generalisation of map2() for more than 2 arguments:

print(a)
##  [1]  1  2  3  4  5  6  7  8  9 10
b <- seq(1, 2, length.out = 10)

print(b)
##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778
##  [9] 1.888889 2.000000
map2(a, b, `*`)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2.222222
## 
## [[3]]
## [1] 3.666667
## 
## [[4]]
## [1] 5.333333
## 
## [[5]]
## [1] 7.222222
## 
## [[6]]
## [1] 9.333333
## 
## [[7]]
## [1] 11.66667
## 
## [[8]]
## [1] 14.22222
## 
## [[9]]
## [1] 17
## 
## [[10]]
## [1] 20

Each element of a gets multiplied by the element of b that is in the same position. Let’s see what pmap() does. Can you guess from the code below what is going on? I will print a and b again for clarity:

a
##  [1]  1  2  3  4  5  6  7  8  9 10
b
##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778
##  [9] 1.888889 2.000000
n <- seq(1:10)

pmap(list(a, b, n), rnorm)
## [[1]]
## [1] 0.5280967
## 
## [[2]]
## [1] -1.270939  2.311809
## 
## [[3]]
## [1]  1.3947087  5.7967488 -0.3431847
## 
## [[4]]
## [1] 4.84305793 0.08361189 5.88116214 3.25397461
## 
## [[5]]
## [1]  3.770117 11.832005 -3.957980 -1.731193  4.098926
## 
## [[6]]
## [1]  2.1400937  0.1393704 12.1352462 10.2759825  4.4688028 -0.2094596
## 
## [[7]]
## [1]  14.057747   7.430422   6.315045  -1.958189  -4.167744 -10.046114 -11.557698
## 
## [[8]]
## [1] 17.7475507  2.2137185  0.5835247 13.1073284 -3.3987124 -0.8162384  3.8022779
## [8]  2.1491597
## 
## [[9]]
## [1]   9.2225601   0.1596010 -14.6581525   8.2725150 -18.5914485  -2.5777416
## [7]   0.1888828  18.6061008  -1.1331377
## 
## [[10]]
##  [1] -7.792564  1.903289 10.412377 11.689196  9.696147 -1.231598  3.400830
##  [8] -7.615827 11.484893 -3.031786

Let’s take a closer look at what a, b and n look like, when they are place next to each other:

cbind(a, b, n)
##        a        b  n
##  [1,]  1 1.000000  1
##  [2,]  2 1.111111  2
##  [3,]  3 1.222222  3
##  [4,]  4 1.333333  4
##  [5,]  5 1.444444  5
##  [6,]  6 1.555556  6
##  [7,]  7 1.666667  7
##  [8,]  8 1.777778  8
##  [9,]  9 1.888889  9
## [10,] 10 2.000000 10

rnorm() gets first called with the parameters from the first line, meaning rnorm(a[1], b[1], n[1]). The second time rnorm() gets called, you guessed it, it with the parameters on the second line of the array above, rnorm(a[2], b[2], n[2]), etc.

There are other functions in the map() family of functions, but we will discover them in the exercises!

The map() family of functions does not have any more secrets for you. Let’s now take a look at the reduce() family of functions.

7.11.2 Reducing with purrr

Reducing is another important concept in functional programming. It allows going from a list of elements, to a single element, by somehow combining the elements into one. For instance, using the base R Reduce() function, you can sum the elements of a list like so:

Reduce(`+`, seq(1:100))
## [1] 5050

using purrr::reduce(), this becomes:

reduce(seq(1:100), `+`)
## [1] 5050

If you don’t really get what happening, don’t worry. Things should get clearer once I’ll introduce another version of reduce(), called accumulate(), which we will see below.

Sometimes, the direction from which we start to reduce is quite important. You can “start from the end” of the list by using the .dir argument:

reduce(seq(1:100), `+`, .dir = "backward")
## [1] 5050

Of course, for commutative operations, direction does not matter. But it does matter for non-commutative operations:

reduce(seq(1:100), `-`)
## [1] -5048
reduce(seq(1:100), `-`, .dir = "backward")
## [1] -50

Let’s now take a look at accumulate(). accumulate() is very similar to map(), but keeps the intermediary results. Which intermediary results? Let’s try and see what happens:

a <- seq(1, 10)

accumulate(a, `-`)
##  [1]   1  -1  -4  -8 -13 -19 -26 -34 -43 -53

accumulate() illustrates pretty well what is happening; the first element, 1, is simply the first element of seq(1, 10). The second element of the result however, is the difference between 1 and 2, -1. The next element in a is 3. Thus the next result is -1-3, -4, and so on until we run out of elements in a.

The below illustration shows the algorithm step-by-step:

(1-2-3-4-5-6-7-8-9-10)
((1)-2-3-4-5-6-7-8-9-10)
((1-2)-3-4-5-6-7-8-9-10)
((-1-3)-4-5-6-7-8-9-10)
((-4-4)-5-6-7-8-9-10)
((-8-5)-6-7-8-9-10)
((-13-6)-7-8-9-10)
((-19-7)-8-9-10)
((-26-8)-9-10)
((-34-9)-10)
(-43-10)
-53

reduce() only shows the final result of all these operations. accumulate() and reduce() also have an .init argument, that makes it possible to start the reducing procedure from an initial value that is different from the first element of the vector:

reduce(a, `+`, .init = 1000)

accumulate(a, `-`, .init = 1000, .dir = "backward")
## [1] 1055
##  [1]  995 -994  996 -993  997 -992  998 -991  999 -990 1000

map() and reduce() are arguably the most useful higher-order functions, and perhaps also the most famous one, true ambassadors of functional programming. You might have read about MapReduce, a programming model for processing big data in parallel. The way MapReduce works is inspired by both these map() and reduce() functions, which are always included in functional programming languages. This illustrates that the functional programming paradigm is very well suited to parallel computing. In the next chapter, we are going to explore parallel computing with the {furrr} package; as you will see, once you know about functional programming and are familiar with {purrr}, parallelizing your code with {furrr} is trivial.

Something else that is very important to understand at this point; up until now, we only used these functions on lists, or atomic vectors, of numbers. However, map() and reduce(), and other higher-order functions for that matter, do not care about the contents of the list. What these functions do, is take another functions, and make it do something to the elements of the list. It does not matter if it’s a list of numbers, of characters, of data frames, even of models. All that matters is that the function that will be applied to these elements, can operate on them. So if you have a list of fitted models, you can map summary() on this list to get the results of each model. Or if you have a list of data frames, you can map a function that performs several cleaning steps. This will be explored in a future section, but it is important to keep this in mind.

7.11.3 Error handling with safely() and possibly()

safely() and possibly() are very useful functions. Consider the following situation:

a <- list("a", 4, 5)

sqrt(a)
Error in sqrt(a) : non-numeric argument to mathematical function

Using map() or Map() will result in a similar error. safely() is an higher-order function that takes one function as an argument and executes it… safely, meaning the execution of the function will not stop if there is an error. The error message gets captured alongside valid results.

a <- list("a", 4, 5)

safe_sqrt <- safely(sqrt)

map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## NULL
## 
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## 
## [[2]]
## [[2]]$result
## [1] 2
## 
## [[2]]$error
## NULL
## 
## 
## [[3]]
## [[3]]$result
## [1] 2.236068
## 
## [[3]]$error
## NULL

possibly() works similarly, but also allows you to specify a return value in case of an error:

possible_sqrt <- possibly(sqrt, otherwise = NA_real_)

map(a, possible_sqrt)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 2.236068

Of course, in this particular example, the same effect could be obtained way more easily:

sqrt(as.numeric(a))
## Warning: NAs introduced by coercion
## [1]       NA 2.000000 2.236068

However, in some situations, this trick does not work as intended (or at all). possibly() and safely() allow the programmer to model errors explicitly, and to then provide a consistent way of dealing with them. For instance, consider the following example:

data(mtcars)

write.csv(mtcars, "my_data/mtcars.csv")
Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'my_data/mtcars.csv': No such file or directory

The folder path/to/save/ does not exist, and as such this code produces an error. You might want to catch this error, and create the directory for instance:

possibly_write.csv <- possibly(write.csv, otherwise = NULL)

if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) {
  print("Creating folder...")
  dir.create("my_data/")
  print("Saving file...")
  write.csv(mtcars, "my_data/mtcars.csv")
}
[1] "Creating folder..."
[1] "Saving file..."
Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'my_data/mtcars.csv': No such file or directory

The warning message comes from the first time we try to write the .csv, inside the if statement. In the exercises, you’ll discover quietly(), which also captures warnings and messages.

7.11.4 Partial applications with partial()

Consider the following simple function:

add <- function(a, b) a+b

It is possible to create a new function, where one of the parameters is fixed, for instance, where a = 10:

add_to_10 <- partial(add, a = 10)
add_to_10(12)
## [1] 22

This is equivalent to the following:

add_to_10_2 <- function(b){
  add(a = 10, b)
}

Using partial() is much less verbose however, and allowing you to define new functions very quickly:

head10 <- partial(head, n = 10)

head10(mtcars)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

7.11.5 Function composition using compose

Function composition is another handy tool, which makes chaining equation much more elegant:

compose(sqrt, log10, exp)(10)
## [1] 2.083973

The above expression is equivalent to:

sqrt(log10(exp(10)))
## [1] 2.083973

It is also possible to reverse the order the functions get called:

compose(sqrt, log10, exp)(10)
## [1] 2.083973
compose(sqrt, log10, exp, .dir="forward")(10)
## [1] 1.648721

One could also use the %>% operator:

10 %>%
  sqrt %>%
  log10 %>%
  exp
## [1] 1.648721

7.11.6 «Transposing lists»

Another interesting function is transpose(). It is not an alternative to the function t() from base but, has a similar effect. transpose() works on lists. Let’s take a look at the example from before:

safe_sqrt <- safely(sqrt, otherwise = NA_real_)

map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## [1] NA
## 
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## 
## [[2]]
## [[2]]$result
## [1] 2
## 
## [[2]]$error
## NULL
## 
## 
## [[3]]
## [[3]]$result
## [1] 2.236068
## 
## [[3]]$error
## NULL

The output is a list with the first element being a list with a result and an error message. One might want to have all the results in a single list, and all the error messages in another list. This is possible with transpose():

purrr::transpose(map(a, safe_sqrt))
## $result
## $result[[1]]
## [1] NA
## 
## $result[[2]]
## [1] 2
## 
## $result[[3]]
## [1] 2.236068
## 
## 
## $error
## $error[[1]]
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## $error[[2]]
## NULL
## 
## $error[[3]]
## NULL

I explicitely call purrr::transpose() because there is also a data.table::transpose(), which is not the same function. You have to be careful about that sort of thing, because it can cause errors in your programs and debuging this type of error is a nightmare.

Now that we are familiar with functional programming, let’s try to apply some of its principles to data manipulation.

7.12 List-based workflows for efficiency

Once you wrote a function, you can easily use it inside a pipe workflow:

double_number <- function(x){
  x+x
}
mtcars %>%
  mutate(double_mpg = double_number(mpg))
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb double_mpg
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4       42.0
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       42.0
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1       45.6
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1       42.8
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2       37.4
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1       36.2
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4       28.6
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2       48.8
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       45.6
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4       38.4
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4       35.6
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3       32.8
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3       34.6
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3       30.4
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4       20.8
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4       20.8
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4       29.4
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1       64.8
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2       60.8
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1       67.8
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1       43.0
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2       31.0
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2       30.4
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4       26.6
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2       38.4
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1       54.6
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2       52.0
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2       60.8
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4       31.6
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6       39.4
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8       30.0
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2       42.8

Granted this example is stupid, but it shows you, again, that functions you define are nothing special. You can use them just as any other.

Now, let’s step back for a bit and think about what we learned up until now, and especially the map() family of functions.

Let’s read the list of datasets from the previous chapter:

paths <- Sys.glob("datasets/unemployment/*.csv")

all_datasets <- import_list(paths)

str(all_datasets)
## List of 4
##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
##   ..$ of which: Wage-earners    : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
##   ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
##   ..$ Unemployed                : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
##   ..$ Active population         : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
##   ..$ Year                      : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ unemp_2014:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
##   ..$ of which: Wage-earners    : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
##   ..$ Unemployed                : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
##   ..$ Active population         : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
##   ..$ Year                      : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ unemp_2015:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
##   ..$ of which: Wage-earners    : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
##   ..$ Unemployed                : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
##   ..$ Active population         : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
##   ..$ Year                      : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ unemp_2016:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
##   ..$ of which: Wage-earners    : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
##   ..$ Unemployed                : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
##   ..$ Active population         : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
##   ..$ Year                      : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...

The first thing we are going to do is use a function to clean the names of the datasets. These names are not very easy to work with; there are spaces, and it would be better if the names of the columns would be all lowercase. For this we are going to use the function clean_names() from the janitor package. For a single dataset, I would write this:

library(janitor)

one_dataset <- one_dataset %>%
  clean_names()

and I would get a dataset with column names in lowercase and spaces replaced by _ (and other corrections). How can I apply, or map, this function to each dataset in the list? To do this I need to use purrr::map():

library(purrr)

all_datasets <- all_datasets %>%
  map(clean_names)

all_datasets %>%
  glimpse()
## List of 4
##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
##   ..$ of_which_wage_earners       : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
##   ..$ unemployed                  : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
##   ..$ active_population           : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
##   ..$ year                        : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ unemp_2014:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
##   ..$ of_which_wage_earners       : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
##   ..$ unemployed                  : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
##   ..$ active_population           : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
##   ..$ year                        : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ unemp_2015:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
##   ..$ of_which_wage_earners       : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
##   ..$ unemployed                  : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
##   ..$ active_population           : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
##   ..$ year                        : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ unemp_2016:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
##   ..$ of_which_wage_earners       : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
##   ..$ unemployed                  : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
##   ..$ active_population           : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
##   ..$ year                        : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...

So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this:

one_dataset %>%
  filter(unemployment_rate_in_percent < 3)

But for a list of datasets, map() is needed (and as you will see, that is not all that is needed):

all_datasets %>%
  map(~filter(., unemployment_rate_in_percent < 3))
## $unemp_2013
##      commune total_employed_population of_which_wage_earners
## 1    Garnich                       844                   750
## 2 Leudelange                      1064                   937
## 3       Bech                       526                   463
##   of_which_non_wage_earners unemployed active_population
## 1                        94         25               869
## 2                       127         32              1096
## 3                        63         16               542
##   unemployment_rate_in_percent year
## 1                     2.876870 2013
## 2                     2.919708 2013
## 3                     2.952030 2013
## 
## $unemp_2014
##      commune total_employed_population of_which_wage_earners
## 1    Garnich                       845                   757
## 2 Leudelange                      1102                   965
## 3       Bech                       543                   476
## 4 Flaxweiler                       879                   789
##   of_which_non_wage_earners unemployed active_population
## 1                        88         19               864
## 2                       137         34              1136
## 3                        67         15               558
## 4                        90         27               906
##   unemployment_rate_in_percent year
## 1                     2.199074 2014
## 2                     2.992958 2014
## 3                     2.688172 2014
## 4                     2.980132 2014
## 
## $unemp_2015
##   commune total_employed_population of_which_wage_earners
## 1    Bech                       520                   450
## 2    Bous                       750                   680
##   of_which_non_wage_earners unemployed active_population
## 1                        70         14               534
## 2                        70         22               772
##   unemployment_rate_in_percent year
## 1                     2.621723 2015
## 2                     2.849741 2015
## 
## $unemp_2016
##             commune total_employed_population of_which_wage_earners
## 1 Reckange-sur-Mess                       980                   850
## 2              Bech                       520                   450
## 3          Betzdorf                      1500                  1350
## 4        Flaxweiler                       910                   820
##   of_which_non_wage_earners unemployed active_population
## 1                       130         30              1010
## 2                        70         11               531
## 3                       150         45              1545
## 4                        90         24               934
##   unemployment_rate_in_percent year
## 1                     2.970297 2016
## 2                     2.071563 2016
## 3                     2.912621 2016
## 4                     2.569593 2016

I know what you’re thinking… what the hell?. Let me explain: map() needs a function to map to each element of the list. all_datasets is the list to which I want to map the function. But what function? filter() is the function I need, so why doesn’t:

all_datasets %>%
  map(filter(unemployment_rate_in_percent < 3))

work? This is a bit complicated, and has to do with what is called environments. If you try to run the code above, you will get this error message:

Error in filter(unemployment_rate_in_percent < 3) :
  object 'unemployment_rate_in_percent' not found

I won’t go into details, but by writing ~filter(., unemployment_rate_in_percent < 3), which is a formula (~ is the symbol to define formulas, more on this in the later chapters), map() converts it to a function that it can use. If you want to know more about this, you can read it in Advanced R by Hadley Wickham, but it is an advanced topic.

Before merging these datasets together, we would need them to have a year column indicating the year. It would also be helpful if gave names to these datasets. For this task, we can use purrr::set_names():

all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016)))

Let’s take a look at the list now:

str(all_datasets)

As you can see, each data.frame object contained in the list has been renamed. You can thus access them with the $ operator:

Using map() we now know how to apply a function to each dataset of a list. But maybe it would be easier to merge all the datasets first, and then manipulate them? Before that though, I am going to teach you how to use purrr::reduce(), another very powerful function that works on lists. This is a function that you can find in other programming languages, but sometimes it is called fold.

I think that the following example illustrates the power of reduce() well:

numbers <- seq(1, 5) # Create a vector with the numbers 1 to 5

reduce(numbers, `+`, .init = 0)
## [1] 15

reduce() takes a function as an argument, here the function +7 and then does the following computation:

0 + numbers[1] + numbers[2] + numbers[3]...

It applies the user supplied function successively but has to start with something, so we give it the argument init also. This argument is actually optional, but I show it here because in some cases it might be useful to start the computations at another value than 0.reduce() generalizes functions that only take two arguments. If you were to write a function that returns the minimum between two numbers:

my_min <- function(a, b){
    if(a < b){
        return(a)
    } else {
        return(b)
    }
}

You could use reduce() to get the minimum of a list of numbers:

numbers2 <- c(3, 1, -8, 9)

reduce(numbers2, my_min)
## [1] -8

As long as you provide a function and a list of elements to reduce(), you will get a single output. So how could reduce() help us with merging all the datasets that are in the list? dplyr comes with a lot of function to merge two datasets. Remember that I said before that reduce() allows you to generalize a function of two arguments? Let’s try it with our list of datasets:

unemp_lux <- reduce(all_datasets, full_join)
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
glimpse(unemp_lux)
## Rows: 472
## Columns: 8
## $ commune                      <chr> "Grand-Duche de Luxembourg", "Canton Cap…
## $ total_employed_population    <int> 223407, 17802, 1703, 844, 1431, 4094, 21…
## $ of_which_wage_earners        <int> 203535, 15993, 1535, 750, 1315, 3800, 18…
## $ of_which_non_wage_earners    <int> 19872, 1809, 168, 94, 116, 294, 272, 113…
## $ unemployed                   <int> 19287, 1071, 114, 25, 74, 261, 98, 45, 6…
## $ active_population            <int> 242694, 18873, 1817, 869, 1505, 4355, 22…
## $ unemployment_rate_in_percent <dbl> 7.947044, 5.674773, 6.274078, 2.876870, …
## $ year                         <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013…

full_join() is one of the dplyr function that merges data. There are others that might be useful depending on the kind of join operation you need. Let’s write this data to disk as we’re going to keep using it for the next chapters:

export(unemp_lux, "datasets/unemp_lux.csv")

7.12.1 Functional programming and plotting

In this section, we are going to learn how to use the possibilities offered by the purrr package and how it can work together with ggplot2 to generate many plots. This is a more advanced topic, but what comes next is also what makes R, and the functional programming paradigm so powerful.

For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune:

unemp_lux_data %>%
  filter(division == "Luxembourg") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
  geom_line()

and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal:

unemp_lux_data %>%
  filter(division == "Esch-sur-Alzette") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()

unemp_lux_data %>%
  filter(division == "Wiltz") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()

But copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting. What if, once you are done copy pasting, you have to change something, for example, the theme? You could use the search and replace function of RStudio, true, but sometimes search and replace can also introduce bugs and typos. You can avoid all these issues by using purrr::map(). What do you need to map over? The commune names. So let’s create a vector of commune names:

communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")

Now we can create the graphs using map(), or map2() to be exact:

plots_tibble <- unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest() %>%
  mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
       theme_minimal() +
       geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
       labs(title = paste("Unemployment in", .y))))

Let’s study this line by line: the first line is easy, we simply use filter() to keep only the communes we are interested in. Then we group by division and use tidyr::nest(). As a refresher, let’s take a look at what this does:

unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest()
## # A tibble: 3 x 2
## # Groups:   division [3]
##   division         data             
##   <chr>            <list>           
## 1 Esch-sur-Alzette <tibble [15 × 7]>
## 2 Luxembourg       <tibble [15 × 7]>
## 3 Wiltz            <tibble [15 × 7]>

This creates a tibble with two columns, division and data, where each individual (or commune in this case) is another tibble with all the original variables. This is very useful, because now we can pass these tibbles to map2(), to generate the plots. But why map2() and what’s the difference with map()? map2() works the same way as map(), but maps over two inputs:

numbers1 <- list(1, 2, 3, 4, 5)

numbers2 <- list(9, 8, 7, 6, 5)

map2(numbers1, numbers2, `*`)
## [[1]]
## [1] 9
## 
## [[2]]
## [1] 16
## 
## [[3]]
## [1] 21
## 
## [[4]]
## [1] 24
## 
## [[5]]
## [1] 25

In our example with the graphs, the two inputs are the data, and the names of the communes. This is useful to create the title with labs(title = paste("Unemployment in", .y)))) where .y is the second input of map2(), the commune names contained in variable division.

So what happened? We now have a tibble called plots_tibble that looks like this:

print(plots_tibble)
## # A tibble: 3 x 3
## # Groups:   division [3]
##   division         data              plot  
##   <chr>            <list>            <list>
## 1 Esch-sur-Alzette <tibble [15 × 7]> <gg>  
## 2 Luxembourg       <tibble [15 × 7]> <gg>  
## 3 Wiltz            <tibble [15 × 7]> <gg>

This tibble contains three columns, division, data and now a new one called plot, that we created before using the last line mutate(plot = ...) (remember that mutate() adds columns to tibbles). plot is a list-column, with elements… being plots! Yes you read that right, the elements of the column plot are literally plots. This is what I meant with list columns. Let’s see what is inside the data and the plot columns exactly:

plots_tibble %>%
  pull(data)
## [[1]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e… total_employed_…
##    <int>            <dbl>            <dbl>            <dbl>            <dbl>
##  1  2001             11.3              665             10.1             10.8
##  2  2002             11.7              677             10.3             11.0
##  3  2003             11.7              674             10.2             10.9
##  4  2004             12.2              659             10.6             11.3
##  5  2005             11.9              654             10.3             11.0
##  6  2006             12.2              657             10.5             11.2
##  7  2007             12.6              634             10.9             11.5
##  8  2008             12.9              638             11.0             11.6
##  9  2009             13.2              652             11.0             11.7
## 10  2010             13.6              638             11.2             11.8
## 11  2011             13.9              630             11.5             12.1
## 12  2012             14.3              684             11.8             12.5
## 13  2013             14.8              694             12.0             12.7
## 14  2014             15.2              703             12.5             13.2
## 15  2015             15.3              710             12.6             13.3
## # … with 2 more variables: unemployed <dbl>, unemployment_rate_in_percent <dbl>
## 
## [[2]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e… total_employed_…
##    <int>            <dbl>            <dbl>            <dbl>            <dbl>
##  1  2001             34.4             2.89             30.4             33.2
##  2  2002             34.8             2.94             30.3             33.2
##  3  2003             35.2             3.03             30.1             33.2
##  4  2004             35.6             3.06             30.1             33.2
##  5  2005             35.6             3.13             29.8             33.0
##  6  2006             35.5             3.12             30.3             33.4
##  7  2007             36.1             3.25             31.1             34.4
##  8  2008             37.5             3.39             31.9             35.3
##  9  2009             37.9             3.49             31.6             35.1
## 10  2010             38.6             3.54             32.1             35.7
## 11  2011             40.3             3.66             33.6             37.2
## 12  2012             41.8             3.81             34.6             38.4
## 13  2013             43.4             3.98             35.5             39.5
## 14  2014             44.6             4.11             36.7             40.8
## 15  2015             45.2             4.14             37.5             41.6
## # … with 2 more variables: unemployed <dbl>, unemployment_rate_in_percent <dbl>
## 
## [[3]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e… total_employed_…
##    <int>            <dbl>            <dbl>            <dbl>            <dbl>
##  1  2001             2.13              223             1.79             2.01
##  2  2002             2.14              220             1.78             2.00
##  3  2003             2.18              223             1.79             2.02
##  4  2004             2.24              227             1.85             2.08
##  5  2005             2.26              229             1.85             2.08
##  6  2006             2.20              206             1.82             2.02
##  7  2007             2.27              198             1.88             2.08
##  8  2008             2.30              200             1.90             2.10
##  9  2009             2.36              201             1.94             2.15
## 10  2010             2.42              195             1.97             2.17
## 11  2011             2.48              190             2.02             2.21
## 12  2012             2.59              188             2.10             2.29
## 13  2013             2.66              195             2.15             2.34
## 14  2014             2.69              185             2.19             2.38
## 15  2015             2.77              180             2.27             2.45
## # … with 2 more variables: unemployed <dbl>, unemployment_rate_in_percent <dbl>

each element of data is a tibble for the specific country with columns year, active_population, etc, the original columns. But obviously, there is no division column. So to plot the data, and join all the dots together, we need to add group = 1 in the call to ggplot2() (whereas if you plot multiple lines in the same graph, you need to write group = division).

But more interestingly, how can you actually see the plots? If you want to simply look at them, it is enough to use pull():

plots_tibble %>%
  pull(plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

And if we want to save these plots, we can do so using map2():

map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
Saving 7 x 5 in image
Saving 6.01 x 3.94 in image
Saving 6.01 x 3.94 in image

This was probably the most advanced topic we have studied yet; but you probably agree with me that it is among the most useful ones. This section is a perfect illustration of the power of functional programming; you can mix and match functions as long as you give them the correct arguments. You can pass data to functions that use data and then pass these functions to other functions that use functions as arguments, such as map().8 map() does not care if the functions you pass to it produces tables, graphs or even another function. map() will simply map this function to a list of inputs, and as long as these inputs are correct arguments to the function, map() will do its magic. If you combine this with list-columns, you can even use map() alongside dplyr functions and map your function by first grouping, filtering, etc…

7.12.2 Modeling with functional programming

As written just above, map() simply applies a function to a list of inputs, and in the previous section we mapped ggplot() to generate many plots at once. This approach can also be used to map any modeling functions, for instance lm() to a list of datasets.

For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are dealing with a binary choice problem; usually, you would use a logistic regression for this.

However, in certain disciplines, especially in the social sciences, the so-called Linear Probability Model is often used as well. The LPM is a simple linear regression, but unlike the standard setting of a linear regression, the dependent variable, or target, is a binary variable, and not a continuous variable. Before you yell “Wait, that’s illegal”, you should know that in practice LPMs do a good job of estimating marginal effects, which is what social scientists and econometricians are often interested in. Marginal effects are another way of interpreting models, giving how the outcome (or the target) changes given a change in a independent variable (or a feature). For instance, a marginal effect of 0.10 for age would mean that probability of success would increase by 10% for each added year of age. We already discussed marginal effects in Chapter 6.

There has been a lot of discussion on logistic regression vs LPMs, and there are pros and cons of using LPMs. Micro-econometricians are still fond of LPMs, even though the pros of LPMs are not really convincing. However, quoting Angrist and Pischke:

“While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little” (source: Mostly Harmless Econometrics)

so LPMs are still used for estimating marginal effects.

Let us check this assessment with one example. First, we simulate some data, then run a logistic regression and compute the marginal effects, and then compare with a LPM:

set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
  
z <- .5 + 2*x1 + 4*x2

p <- 1/(1 + exp(-z))

y <- rbinom(100, 1, p)

df <- tibble(y = y, x1 = x1, x2 = x2)

This data generating process generates data from a binary choice model. Fitting the model using a logistic regression allows us to recover the structural parameters:

logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit"))

Let’s see a summary of the model fit:

summary(logistic_regression)
## 
## Call:
## glm(formula = y ~ ., family = binomial(link = "logit"), data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.91941  -0.44872   0.00038   0.42843   2.55426  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.0960     0.3293   0.292 0.770630    
## x1            1.6625     0.4628   3.592 0.000328 ***
## x2            3.6582     0.8059   4.539 5.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.629  on 99  degrees of freedom
## Residual deviance:  60.576  on 97  degrees of freedom
## AIC: 66.576
## 
## Number of Fisher Scoring iterations: 7

We do recover the parameters that generated the data, but what about the marginal effects? We can get the marginal effects easily using the {margins} package:

library(margins)

margins(logistic_regression)
## Average marginal effects
## glm(formula = y ~ ., family = binomial(link = "logit"), data = df)
##      x1     x2
##  0.1598 0.3516

Or, even better, we can compute the true marginal effects, since we know the data generating process:

meffects <- function(dataset, coefs){
  X <- dataset %>% 
  select(-y) %>% 
  as.matrix()
  
  dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2])
  dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3])
  
  tribble(~term, ~true_effect,
          "x1", dydx_x1,
          "x2", dydx_x2)
}

(true_meffects <- meffects(df, c(0.5, 2, 4)))
## # A tibble: 2 x 2
##   term  true_effect
##   <chr>       <dbl>
## 1 x1          0.175
## 2 x2          0.350

Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects?

lpm <- lm(y ~ ., data = df)

summary(lpm)
## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83953 -0.31588 -0.02885  0.28774  0.77407 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.51340    0.03587  14.314  < 2e-16 ***
## x1           0.16771    0.03545   4.732 7.58e-06 ***
## x2           0.31250    0.03449   9.060 1.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3541 on 97 degrees of freedom
## Multiple R-squared:  0.5135, Adjusted R-squared:  0.5034 
## F-statistic: 51.18 on 2 and 97 DF,  p-value: 6.693e-16

It’s not too bad, but maybe it could have been better in other circumstances. Perhaps if we had more observations, or perhaps for a different set of structural parameters the results of the LPM would have been closer. The LPM estimates the marginal effect of x1 to be 0.1677134 vs 0.1597956 for the logistic regression and for x2, the LPM estimation is 0.3124966 vs 0.351607. The true marginal effects are 0.1750963 and 0.3501926 for x1 and x2 respectively.

Just as to assess the accuracy of a model data scientists perform cross-validation, a Monte Carlo study can be performed to asses how close the estimation of the marginal effects using a LPM is to the marginal effects derived from a logistic regression. It will allow us to test with datasets of different sizes, and generated using different structural parameters.

First, let’s write a function that generates data. The function below generates 10 datasets of size 100 (the code is inspired by this StackExchange answer):

generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){

  generate_one_dataset <- function(coefs, sample_size){
  x1 <- rnorm(sample_size)
  x2 <- rnorm(sample_size)
  
  z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2

  p <- 1/(1 + exp(-z))

  y <- rbinom(sample_size, 1, p)

  df <- tibble(y = y, x1 = x1, x2 = x2)
  }

  simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size))
 
  tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations))
}

Let’s first generate one dataset:

one_dataset <- generate_datasets(repeats = 1)

Let’s take a look at one_dataset:

one_dataset
## # A tibble: 1 x 4
##   coefs     sample_size repeats simulations
##   <list>          <dbl>   <dbl> <list>     
## 1 <dbl [3]>         100       1 <list [1]>

As you can see, the tibble with the simulated data is inside a list-column called simulations. Let’s take a closer look:

str(one_dataset$simulations)
## List of 1
##  $ :List of 1
##   ..$ : tibble [100 × 3] (S3: tbl_df/tbl/data.frame)
##   .. ..$ y : int [1:100] 0 1 1 1 0 1 1 0 0 1 ...
##   .. ..$ x1: num [1:100] 0.437 1.06 0.452 0.663 -1.136 ...
##   .. ..$ x2: num [1:100] -2.316 0.562 -0.784 -0.226 -1.587 ...

The structure is quite complex, and it’s important to understand this, because it will have an impact on the next lines of code; it is a list, containing a list, containing a dataset! No worries though, we can still map over the datasets directly, by using modify_depth() instead of map().

Now, let’s fit a LPM and compare the estimation of the marginal effects with the true marginal effects. In order to have some confidence in our results, we will not simply run a linear regression on that single dataset, but will instead simulate hundreds, then thousands and ten of thousands of data sets, get the marginal effects and compare them to the true ones (but here I won’t simulate more than 500 datasets).

Let’s first generate 10 datasets:

many_datasets <- generate_datasets()

Now comes the tricky part. I have this object, many_datasets looking like this:

many_datasets
## # A tibble: 1 x 4
##   coefs     sample_size repeats simulations
##   <list>          <dbl>   <dbl> <list>     
## 1 <dbl [3]>         100      10 <list [10]>

I would like to fit LPMs to the 10 datasets. For this, I will need to use all the power of functional programming and the {tidyverse}. I will be adding columns to this data frame using mutate() and mapping over the simulations list-column using modify_depth(). The list of data frames is at the second level (remember, it’s a list containing a list containing data frames).

I’ll start by fitting the LPMs, then using broom::tidy() I will get a nice data frame of the estimated parameters. I will then only select what I need, and then bind the rows of all the data frames. I will do the same for the true marginal effects.

I highly suggest that you run the following lines, one after another. It is complicated to understand what’s going on if you are not used to such workflows. However, I hope to convince you that once it will click, it’ll be much more intuitive than doing all this inside a loop. Here’s the code:

results <- many_datasets %>% 
  mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
  mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
  mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
  mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
  mutate(lpm = map(lpm, bind_rows)) %>% 
  mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
  mutate(true_effect = map(true_effect, bind_rows))

This is how results looks like:

results
## # A tibble: 1 x 6
##   coefs     sample_size repeats simulations lpm               true_effect      
##   <list>          <dbl>   <dbl> <list>      <list>            <list>           
## 1 <dbl [3]>         100      10 <list [10]> <tibble [20 × 2]> <tibble [20 × 2]>

Let’s take a closer look to the lpm and true_effect columns:

results$lpm
## [[1]]
## # A tibble: 20 x 2
##    term  estimate
##    <chr>    <dbl>
##  1 x1       0.228
##  2 x2       0.353
##  3 x1       0.180
##  4 x2       0.361
##  5 x1       0.165
##  6 x2       0.374
##  7 x1       0.182
##  8 x2       0.358
##  9 x1       0.125
## 10 x2       0.345
## 11 x1       0.171
## 12 x2       0.331
## 13 x1       0.122
## 14 x2       0.309
## 15 x1       0.129
## 16 x2       0.332
## 17 x1       0.102
## 18 x2       0.374
## 19 x1       0.176
## 20 x2       0.410
results$true_effect
## [[1]]
## # A tibble: 20 x 2
##    term  true_effect
##    <chr>       <dbl>
##  1 x1          0.183
##  2 x2          0.366
##  3 x1          0.166
##  4 x2          0.331
##  5 x1          0.174
##  6 x2          0.348
##  7 x1          0.169
##  8 x2          0.339
##  9 x1          0.167
## 10 x2          0.335
## 11 x1          0.173
## 12 x2          0.345
## 13 x1          0.157
## 14 x2          0.314
## 15 x1          0.170
## 16 x2          0.340
## 17 x1          0.182
## 18 x2          0.365
## 19 x1          0.161
## 20 x2          0.321

Let’s bind the columns, and compute the difference between the true and estimated marginal effects:

simulation_results <- results %>% 
  mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>%  
  mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
  mutate(difference = map(difference, ~select(., term, difference))) %>% 
  pull(difference) %>% 
  .[[1]]
## Joining, by = "term"

Let’s take a look at the simulation results:

simulation_results %>% 
  group_by(term) %>% 
  summarise(mean = mean(difference), 
            sd = sd(difference))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term     mean     sd
##   <chr>   <dbl>  <dbl>
## 1 x1     0.0122 0.0368
## 2 x2    -0.0141 0.0311

Already with only 10 simulated datasets, the difference in means is not significant. Let’s rerun the analysis, but for difference sizes. In order to make things easier, we can put all the code into a nifty function:

monte_carlo <- function(coefs, sample_size, repeats){
  many_datasets <- generate_datasets(coefs, sample_size, repeats)
  
  results <- many_datasets %>% 
    mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    mutate(lpm = map(lpm, bind_rows)) %>% 
    mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    mutate(true_effect = map(true_effect, bind_rows))

  simulation_results <- results %>% 
    mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% 
    mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    mutate(difference = map(difference, ~select(., term, difference))) %>% 
    pull(difference) %>% 
    .[[1]]

  simulation_results %>% 
    group_by(term) %>% 
    summarise(mean = mean(difference), 
              sd = sd(difference))
}

And now, let’s run the simulation for different parameters and sizes:

monte_carlo(c(.5, 2, 4), 100, 10)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term      mean     sd
##   <chr>    <dbl>  <dbl>
## 1 x1    -0.00826 0.0318
## 2 x2    -0.00732 0.0421
monte_carlo(c(.5, 2, 4), 100, 100)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term     mean     sd
##   <chr>   <dbl>  <dbl>
## 1 x1    0.00360 0.0408
## 2 x2    0.00517 0.0459
monte_carlo(c(.5, 2, 4), 100, 500)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term       mean     sd
##   <chr>     <dbl>  <dbl>
## 1 x1    -0.00152  0.0388
## 2 x2    -0.000701 0.0462
monte_carlo(c(pi, 6, 9), 100, 10)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term      mean     sd
##   <chr>    <dbl>  <dbl>
## 1 x1    -0.00829 0.0421
## 2 x2     0.00178 0.0397
monte_carlo(c(pi, 6, 9), 100, 100)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term     mean     sd
##   <chr>   <dbl>  <dbl>
## 1 x1    0.0107  0.0576
## 2 x2    0.00831 0.0772
monte_carlo(c(pi, 6, 9), 100, 500)
## Joining, by = "term"
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   term     mean     sd
##   <chr>   <dbl>  <dbl>
## 1 x1    0.00879 0.0518
## 2 x2    0.0113  0.0687

We see that, at least for this set of parameters, the LPM does a good job of estimating marginal effects.

Now, this study might in itself not be very interesting to you, but I believe the general approach is quite useful and flexible enough to be adapted to all kinds of use-cases.

7.13 Exercises

Exercise 1

Suppose you have an Excel workbook that contains data on three sheets. Create a function that reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one sheet (download the example Excel workbook, example_workbook.xlsx, from the assets folder on the books github).

Exercise 2

Use one of the map() functions to combine two lists into one. Consider the following two lists:

mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))

continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))

The result we’d like to have would look like this:

$starters
$starters[[1]]
[1] "humous"

$starters[[2]]
[1] "olives"

$starters[[3]]
[1] "pea soup"

$starters[[4]]
[1] "terrine"


$dishes
$dishes[[1]]
[1] "sardines"

$dishes[[2]]
[1] "lasagna"

$dishes[[3]]
[1] "frikadelle"

$dishes[[4]]
[1] "sauerkraut"

  1. This is simply the + operator you’re used to. Try this out: `+`(1, 5) and you’ll see + is a function like any other. You just have to write backticks around the plus symbol to make it work.↩︎

  2. Functions that have other functions as input are called higher order functions↩︎