Chapter 5 Programming with the tidyverse
Functions are very powerful because by using them, we avoid repetition. This means that we must be able to write functions that allow the user to abstract over certain things, such as columns names of datasets. So for example, one would like to write a function that would look like that:
my_function(my_data, column)
and in this chapter we will learn how to do that using dplyr
(version 0.70 or above).
I advise you to also read the “Programming with dplyr” vignette here, which explains with great detail the concept I will only skim in this chapter!
Consider the following code:
data(mtcars)
simple_function = function(dataset, col_name){
dataset %>%
group_by(col_name) %>%
summarise(mean_mpg = mean(mpg)) -> dataset
return(dataset)
}
When you try to run this:
simple_function(mtcars, "cyl")
This is the error you get:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `col_name` is unknown
R is literally looking for a column called col_name
in the mtcars
dataset. How to solve this issue and make R understand to not take “cyl” literally as a string, but to interpret it?
One way is to use the enquo()
function (or quo()
if working interactively), in conjunction with the !!
operator introduced with dplyr
0.7 (but actually part of the rlang
package, which gets used by dplyr
seamlessly). First let’s look at the solution and then I’ll explain how it works:
library(dplyr)
simple_function = function(dataset, col_name){
col_name = enquo(col_name)
dataset = dataset %>%
group_by(!!col_name) %>%
summarise(mean_mpg = mean(mpg))
return(dataset)
}
simple_function(mtcars, cyl)
## # A tibble: 3 x 2
## cyl mean_mpg
## <int> <dbl>
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
In the above example, I wanted the mean of mpg
but first by grouping by cyl
. The enquo()
to quote the input. This tells your function that the variable col_name
should be quoted. However then, filter()
(and the other dplyr
functions) need to have an unquoted variable name. So !!()
does this and evaluates its argument.
If you want to use filter()
with a value that the user has to provide, you can also do that:
simpleFunction = function(dataset, col_name, value){
col_name = enquo(col_name)
dataset = dataset %>%
filter((!!col_name) == value) %>%
summarise(mean_cyl = mean(cyl))
return(dataset)
}
simpleFunction(mtcars, am, 1)
## mean_cyl
## 1 5.076923
There is something that you must pay attention to in the above example. Notice that I’ve written:
filter((!!col_name) == value)
and not:
filter(!!col_name == value)
I have enclosed !!col_name
inside parentheses, because ==
has precedence over !!
.
Let’s make this function a bit more general. I hard-coded the variable cyl
inside the body of the function, but what if you need more flexibility and let the user provide the variable to group by to?
simpleFunction = function(dataset, group_col, mean_col, value){
group_col = enquo(group_col)
mean_col = enquo(mean_col)
dataset = dataset %>%
filter((!!group_col) == value) %>%
summarise(mean((!!mean_col)))
return(dataset)
}
simpleFunction(mtcars, am, cyl, 1)
## mean((cyl))
## 1 5.076923
It is possible to set the name of the column in the output using :=
instead of =
:
simpleFunction = function(dataset, group_col, mean_col, value){
group_col = enquo(group_col)
mean_col = enquo(mean_col)
mean_name = paste0("mean_", mean_col)[2]
dataset %>%
filter((!!group_col) == value) %>%
summarise(!!mean_name := mean((!!mean_col))) -> dataset
return(dataset)
}
simpleFunction(mtcars, am, cyl, 1)
## mean_cyl
## 1 5.076923
To get the name of the column I added this line:
mean_name = paste0("mean_", mean_col)[2]
To see what it does, try the following inside an R interpreter (remember to us quo()
instead of enquo()
outside functions!):
paste0("mean_", quo(cyl))
## [1] "mean_~" "mean_cyl"
enquo()
quotes the input, and with paste0()
it gets converted to a string that can be used as a column name. However, the ~
is in the way and the output of paste0()
is a vector of two strings: the correct name is contained in the second element, hence the [2]
. There might be a more elegant way of doing that, but for now this has been working well for me.
It is also possible to write functions that take any amount of variables the user wants:
nice_function = function(dataset, ...){
variables = quos(...)
dataset %>%
summarise_at(vars(!!!variables), mean)
}
nice_function(mtcars, mpg, cyl)
## mpg cyl
## 1 20.09062 6.1875
You can even be more flexible, and let the user define the functions that will be used by summarised_at()
:
nice_function = function(dataset, cols, funcs){
dataset %>%
summarise_at(vars(!!!cols), funs(!!!funcs))
}
list_cols = quos(mpg, cyl)
list_funs = quos(mean, sd, sum)
nice_function(mtcars, list_cols, list_funs)
## mpg_mean cyl_mean mpg_sd cyl_sd mpg_sum cyl_sum
## 1 20.09062 6.1875 6.026948 1.785922 642.9 198
In this last example however, the user has to provide a list of quoted variables and functions.