Chapter 8 Data products
What you’ll have learned by the end of the chapter: you’ll know how to build data products using Quarto and Shiny.
8.1 Introduction
We are going to start by building data products using Quarto.
Quarto is a tool created by Posit, the company behind RStudio and Shiny. Quarto leverages pandoc to
convert between many document formats (for example, from .md
to .docx
) and makes it possible to
embed R, Python, Julia and Observable JS code into documents. It is not an R-specific tool, so it is
a program that you must install on your computer. So go to this
link and download and install Quarto.
We are going to start simple, with “static” data products. By static I mean products without any
sort of interactivity, so the user can look at them, read them, but not change them in any way.
These products are essentially going to be documents in the .docx
, .pptx
and .pdf
formats,
but also .html
. Thanks to Quarto, it is thus possible to programmatically create documents.
8.2 A first taste of Quarto
A Quarto file looks very close to a standard Markdown file. So if you know Markdown, you will not have many problems to switch to Quarto. If you don’t know Markdown, no worries, its syntax is quite simple and can be very quickly picked up.
Let’s start with a basic Quarto source. Open your favorite text editor (doesn’t have to be RStudio) and
create a file called example.qmd
and copy the following lines in it:
---
title: "My Title"
author: "My name"
date: today
---
## This is a simple quarto document
```{r}
n <- 10
rnorm(n) ```
The first few lines of the document is where you can define the title of the document, the name of the author
and the date. For the date, I’ve use the today
keyword to get today’s date
but you could use a string to set the date to a specific day (for example, “2022-10-28”).
The content in the document consists of a level 2 title (## This is a simple quarto document
) and of an
R code chunk. Code chunks is were you will write code that gets then evaluated at render (compile) time.
To compile this file, run the following inside a terminal:
➤ quarto render example.qmd
If you’re inside RStudio, you can also render the document by pressing CTRL-SHIFT-K
or run the command:
::quarto_render("example.qmd") quarto
There are various ways to integrate Quarto with different editors:
Once the file is done rendering, you should find an html
file in the same folder.
Open this html file inside a web browser and see the output. It is possible to run arbitrary R code
inside the code chunks:
---
title: "My Title"
author: "My name"
date: today
---
## This is a simple quarto document
```{r}
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp %>%
janitor::clean_names() %>%
filter(level == "Commune",
place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
group_by(place_name) %>%
nest() %>%
mutate(plots = map2(.x = data, .y = place_name, ~ggplot(data = .x) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
labs(title = paste("Unemployment in", .y)))) %>%
pull(plots) ```
This is what the output looks like.
As you can see, it is quite easy to create a document with potentially hundreds of plots using what we’ve learned
until now. However, our document does not look great; for starters, we see the source code there, which we would like to hide.
People that will read this document might not be interested in the source code, but only in the plots. The other issue
is that when loading the {dplyr}
package, users get some message informing them about some functions that get masked.
We would like to hide all of this. It turns out that code chunks have options, and we can use them to hide source code
and warning messages:
---
title: "My Title"
author: "My name"
date: today
---
## This is a simple quarto document
```{r}
#| echo: false
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp %>%
janitor::clean_names() %>%
filter(level == "Commune",
place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
group_by(place_name) %>%
nest() %>%
mutate(plots = map2(.x = data, .y = place_name, ~ggplot(data = .x) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
labs(title = paste("Unemployment in", .y)))) %>%
pull(plots) ```
This is what the output looks like.
Rendering this document will result in something nicer. We could also fold the code instead of completely removing
it. This is useful if we need to send the document to collaborators who might be interested in the source code as well.
However, code folding is something that only works in html
outputs, and thus we need to specify the output format in the
header of the document (look at the three new lines after we define the data), and also remove the echo: false
option
from the R chunk:
---
title: "My Title"
author: "My name"
date: today
format:
html:
code-fold: true
---
## This is a simple quarto document
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp %>%
janitor::clean_names() %>%
filter(level == "Commune",
place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
group_by(place_name) %>%
nest() %>%
mutate(plots = map2(.x = data, .y = place_name, ~ggplot(data = .x) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
labs(title = paste("Unemployment in", .y)))) %>%
pull(plots) ```
This is what the output looks like.
It is of course possible to write several R chunks:
---
title: "My Title"
author: "My name"
date: today
format:
html:
code-fold: true
---
## This is a simple quarto document
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp <- unemp %>%
janitor::clean_names() %>%
filter(level == "Commune")
```
There are `r length(unique(unemp$place_name))` communes in the dataset.
Below we plot the unemployment rate for 3 communes:
```{r}
unemp %>%
filter(place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
group_by(place_name) %>%
nest() %>%
mutate(plots = map2(.x = data, .y = place_name, ~ggplot(data = .x) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
labs(title = paste("Unemployment in", .y)))) %>%
pull(plots) ```
This is what the output looks like.
8.2.1 Python and Julia code chunks
It is possible, inside the same Quarto document, to define code chunks that run Python (or even Julia) code.
Put the following lines inside a file called example2.qmd
(to run the example below, you will need to have Python installed):
---
title: "R and Python"
author: "Bruno Rodrigues"
date: today
---
## This is a simple quarto document
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp <- unemp %>%
janitor::clean_names() %>%
filter(level == "Commune")
```
```{python}
print("hello from Python")
import sys
print(sys.version) ```
This is what the output looks like.
If you have trouble rendering this line, make sure that you have the jupyter
and jupyterlab
modules installed.
It is also possible to pass objects from R to Python (and vice-versa):
---
title: "R and Python"
author: "Bruno Rodrigues"
date: today
---
## This is a simple quarto document
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(myPackage)
data("unemp")
unemp <- unemp %>%
janitor::clean_names() %>%
filter(level == "Commune")
```
The data that was loaded and cleaned from R can be accessed from Python using `r.unemp`:
```{python}
import pandas as pd
unemp_pd = pd.DataFrame(r.unemp)
unemp_pd.head ```
This is what the output looks like.
The HTML output is quite flexible, as it is possible to also integrate JS libraries. The following
example uses the {g2r}
library (an R wrapper around the g2 javascript library) for creating
visualisations. To run the following code, make sure that you have the {g2r}
package installed
(can only be install from github):
::install_github("devOpifex/g2r") devtools
The source file looks like this:
---
title: "Quarto and JS libraries"
author: "My name"
date: today
format:
html:
code-fold: true
---
## This is a simple quarto document showing basic plot interactivity using {g2r}
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(g2r)
library(myPackage)
data("unemp")
unemp <- unemp %>%
janitor::clean_names() %>%
filter(level == "Commune")
```
There are `r length(unique(unemp$place_name))` communes in the dataset. Below we plot the unemployment rate for 3 communes:
```{r}
unemp %>%
filter(place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
g2(data = .) %>%
fig_line(asp(year, unemployment_rate_in_percent, color = place_name)) ```
This is what the output looks like.
It is possible to use other JS libraries, like here DataTables
, wrapped inside the {DT}
package:
---
title: "Quarto and JS libraries"
author: "My name"
date: today
format:
html:
toc: true
code-fold: true
---
## Basic plot interactivity using {g2r}
```{r}
#| warning: false
library(dplyr)
library(tidyr)
library(purrr)
library(g2r)
library(DT)
library(myPackage)
data("unemp")
unemp <- unemp %>%
mutate(year = as.character(year)) %>%
janitor::clean_names() %>%
filter(level == "Commune")
```
There are `r length(unique(unemp$place_name))` communes in the dataset. Below we plot the unemployment rate for 3 communes:
```{r}
unemp %>%
filter(place_name %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
g2(data = .) %>%
fig_line(asp(year, unemployment_rate_in_percent, color = place_name))
```
## Interactive tables with {DT}
```{r}
unemp %>%
DT::datatable(filter = "top")
```
## Others
You can find more widgets over [here](http://gallery.htmlwidgets.org/).
This is what the output looks like.
The final example illustrates templating. It is possible to write code that generates qmd code:
---
title: "Templating with Quarto"
author: "Bruno Rodrigues"
date: today
format:
html:
toc: true
---
# Set up
The goal is to have a frequency table for each question in a survey. But we
do not want to have to do it by hand, so we define a function to create a
table, and then, using the templating capabilities of Quarto, write some
code to generate valid qmarkdown code. In the example below our survey only
has 4 questions, but the solution described trivially scales to an infinity
of questions. This is not the case if you’re solving this problem by hand.
Start by loading the data and defining some needed variables:
```{r}
#| warning: false
library(lubridate)
library(dplyr)
library(purrr)
library(rlang)
library(DT)
survey_data <- read.csv(
"https://gist.githubusercontent.com/b-rodrigues/0c2249dec5a9c9477e0d1ad9964a1340/raw/873bcc7532b8bad613235f029884df1d0b947c90/survey_example.csv"
)
```
Let’s take a look at the data:
```{r}
datatable(survey_data)
```
The column names are actually questions, so we save those in a variable:
```{r}
questions <- colnames(survey_data)
questions
```
Now we define question codes:
```{r}
codes <- paste0("var_", seq(1, 4))
codes
```
We create a lookup table that links questions to their codes:
```{r}
lookup <- bind_cols("codes" = codes, "questions" = questions)
datatable(lookup)
```
Finally, we replace the question names in the dataset by the code:
```{r}
colnames(survey_data) <- codes
datatable(survey_data)
```
Now, we define a function that creates a frequency table. This function has
two arguments: `dataset` and `var`. It uses the `dplyr::count()` function to
count each instance of the levels of `var` in `dataset`. Then it uses the
`knitr::kable()` function. This functions takes a data frame as an argument
and returns a table formatted in markdown code:
```{r}
create_table <- function(dataset, var){
dataset %>%
count(!!var) %>%
knitr::kable()
}
```
The next function is the one that does the magic: it takes only one argument
as an input, and generates valid markdown code using the `knitr::knit_expand()`
function. Any variable between `{{}}` gets replaced by its value (so
`{{question}}` gets replaced by the question that gets fetched from the
lookup table defined above). Using this function, we can now loop over
question codes, and what we get in return is valid markdown code that defines
a section with the question as the title, and our table.
```{r}
return_section <- function(var){
a <- knitr::knit_expand(text = c("## {{question}}", create_table(survey_data, var)),
question = lookup$questions[grepl(quo_name(var), lookup$codes)])
cat(a, sep = "\n")
}
```
Our codes are strings, so to be able to use them inside of `dplyr::count()`
we need to define them as bare string, or symbols. This can be done using the
`rlang::sym()` function. If this is confusing, try running `count(mtcars, "am")`
and you will see that it will not return what you want (compare to `count(mtcars, am)`).
This is also why we needed `rlang::quo_name()` in the function above, to convert
the symbol back to a string, which is what `grepl()` requires:
```{r}
sym_codes <- map(codes, sym)
```
Finally, we can create the sections. The line below uses `purrr::walk()`, which
is equivalent to `purrr::map()`, the difference being that we use `purrr::walk()`
when we are interested in the side effects of a function:
```{r, results="asis"}
walk(sym_codes, return_section) ```
8.3 Other output formats
8.3.1 Word
Let’s now generate a Word document using Quarto. As you will see, this will be quite easy; but keep in mind that the basic
interactivity that we have seen with HTML outputs won’t be possible here (but templating will work). Render the following
source file to get back a .docx
document (you don’t even need to have MS Word installed for it to work), and
take of what we changed from the previous file:
- Output changed from
html
todocx
; - No more
{DT}
, but{pander}
instead to generated.docx
tables
Here is the file:
---
title: "Templating with Quarto"
author: "Bruno Rodrigues"
date: today
format: docx
---
# Set up
The goal is to have a frequency table for each question in a survey. But we
do not want to have to do it by hand, so we define a function to create a
table, and then, using the templating capabilities of Quarto, write some
code to generate valid qmarkdown code. In the example below our survey only
has 4 questions, but the solution described trivially scales to an infinity
of questions. This is not the case if you’re solving this problem by hand.
Start by loading the data and defining some needed variables:
```{r}
#| warning: false
library(lubridate)
library(dplyr)
library(purrr)
library(pander)
library(rlang)
survey_data <- read.csv(
"https://gist.githubusercontent.com/b-rodrigues/0c2249dec5a9c9477e0d1ad9964a1340/raw/873bcc7532b8bad613235f029884df1d0b947c90/survey_example.csv"
)
```
Let’s take a look at the data:
```{r}
pander(head(survey_data))
```
The column names are actually questions, so we save those in a variable:
```{r}
questions <- colnames(survey_data)
questions
```
Now we define question codes:
```{r}
codes <- paste0("var_", seq(1, 4))
codes
```
We create a lookup table that links questions to their codes:
```{r}
lookup <- bind_cols("codes" = codes, "questions" = questions)
pander(lookup)
```
Finally, we replace the question names in the dataset by the code:
```{r}
colnames(survey_data) <- codes
pander(survey_data)
```
Now, we define a function that creates a frequency table. This function has
two arguments: `dataset` and `var`. It uses the `dplyr::count()` function to
count each instance of the levels of `var` in `dataset`. Then it uses the
`knitr::kable()` function. This functions takes a data frame as an argument
and returns a table formatted in markdown code:
```{r}
create_table <- function(dataset, var){
dataset %>%
count(!!var) %>%
knitr::kable()
}
```
The next function is the one that does the magic: it takes only one argument
as an input, and generates valid markdown code using the `knitr::knit_expand()`
function. Any variable between `{{}}` gets replaced by its value (so
`{{question}}` gets replaced by the question that gets fetched from the
lookup table defined above). Using this function, we can now loop over
question codes, and what we get in return is valid markdown code that defines
a section with the question as the title, and our table.
```{r}
return_section <- function(var){
a <- knitr::knit_expand(text = c("## {{question}}", create_table(survey_data, var)),
question = lookup$questions[grepl(quo_name(var), lookup$codes)])
cat(a, sep = "\n")
}
```
Our codes are strings, so to be able to use them inside of `dplyr::count()`
we need to define them as bare string, or symbols. This can be done using the
`rlang::sym()` function. If this is confusing, try running `count(mtcars, "am")`
and you will see that it will not return what you want (compare to `count(mtcars, am)`).
This is also why we needed `rlang::quo_name()` in the function above, to convert
the symbol back to a string, which is what `grepl()` requires:
```{r}
sym_codes <- map(codes, sym)
```
Finally, we can create the sections. The line below uses `purrr::walk()`, which
is equivalent to `purrr::map()`, the difference being that we use `purrr::walk()`
when we are interested in the side effects of a function:
```{r, results="asis"}
walk(sym_codes, return_section) ```
You can download the output here.
Unlike with HTML outputs, it is also not possible to enable code folding, but you could hide the code completely using the “#| echo = false”
chunk option. If you wan to hide all the code without having to specify “#| echo = false” on each chunk you can also add the execute
option
to the document header:
---
title: "Templating with Quarto"
author: "Bruno Rodrigues"
date: today
format: docx
execute:
echo: false ---
You can use a document as a template for Word documents generated with Quarto. For this, you must create a new Word file, and update the styles. This document, with the updated styles, can then be referenced in the header to act as a template:
---
title: "Using a custom Word style"
author: "Bruno Rodrigues"
date: today
format:
docx:
reference-doc: fancy_template.docx
execute:
echo: false
---
# Introduction
## MS Word is great (lol)
This is normal text that is unreadable.
Just put fancy_template.docx
in the same folder as your source qmd file. You can download the template I’ve used from
here to test things out.
For more details, visit this page.
8.3.2 Presentations
It is also possible to create presentations using Quarto. There are output formats as well: HTML, PDF and Powerpoint. I will not discuss this here, because it is quite easy to get started, simply follow along.
8.3.3 PDF
I do need to discuss the PDF output a little bit. In order to generate PDF files, Quarto uses the pdflatex
compiler (or rather
pandoc, called by Quarto, uses pdflatex
). pdflatex
compiles .tex
source files to PDF, so what Quarto does (by leveraging pandoc)
is first converting a .qmd
file to a .tex
file, and then call pdflatex
to compile it. .tex
files are the file extension of the
Latex typesetting language, extensively used in science. It makes it easy to write complex mathematical formulas, like this one:
\[\begin{align*} S(\omega) &= \frac{\alpha g^2}{\omega^5} e^{[ -0.74\bigl\{\frac{\omega U_\omega 19.5}{g}\bigr\}^{\!-4}\,]} \\ &= \frac{\alpha g^2}{\omega^5} \exp\Bigl[ -0.74\Bigl\{\frac{\omega U_\omega 19.5}{g}\Bigr\}^{\!-4}\,\Bigr] \end{align*}\]
Latex is a bit unwieldly, so using Markdown to write scientific documents is becoming more and more popular. However, Latex still has an edge when it comes to tables. But thankfully, it is possible to simply embed the Latex code that produces these tables in Markdown, and there are packages that export regression table directly to PDF. In any case, in order to compile to PDF, you need to install Texlive. Installing Texlive is frankly a mess, but thankfully there is a very simple alternative called TinyTex. TinyTex is both available as an R package or as a standalone installation, and was put together by the author of RMarkdown (in a sense, Quarto is a spiritual successor to RMarkdown). This package installs a self-contained Texlive installation locally, which can then be used to compile PDF documents (from, or outside of R/RStudio). I highly recommend you use Tinytex. Instructions can be found here. Once you’ve installed TinyTex, you can try to compile the following example document (the first time you run this, it might take some time, as the required packages get installed):
---
title: "PDF example with table"
format: pdf
---
## A PDF document using Quarto
In the code below, we fit several models and then use the `{modelsummary}`
package to print a nicely formatted table with minimal effort:
```{r}
library(modelsummary)
url <- 'https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'
dat <- read.csv(url)
models <- list(
"OLS 1" = lm(Donations ~ Literacy + Clergy, data = dat),
"Poisson 1" = glm(Donations ~ Literacy + Commerce, family = poisson, data = dat),
"OLS 2" = lm(Crime_pers ~ Literacy + Clergy, data = dat),
"Poisson 2" = glm(Crime_pers ~ Literacy + Commerce, family = poisson, data = dat),
"OLS 3" = lm(Crime_prop ~ Literacy + Clergy, data = dat)
)
modelsummary(models)
```
And an equation, for good measure:
\begin{align*}
S(\omega)
&= \frac{\alpha g^2}{\omega^5} e^{[ -0.74\bigl\{\frac{\omega U_\omega 19.5}{g}\bigr\}^{\!-4}\,]} \\
&= \frac{\alpha g^2}{\omega^5} \exp\Bigl[ -0.74\Bigl\{\frac{\omega U_\omega 19.5}{g}\Bigr\}^{\!-4}\,\Bigr] \end{align*}
This is what the output looks like (scroll down to page 2)..
It is possible to author, many, many, different types of documents using Quarto. For more formats, consult this page. Quarto is still very new – it was officially anounced in July of 2022 by Posit– so much more content will arrive. There are still many features of Quarto that we have not explored, so take your time to read its documentation in detail.
8.4 Interactive web applications with {shiny}
{shiny}
is a package developed by Posit to build interactive web applications. These apps can be quite “simple”
(for example, an app that shows a graph but in which the user can choose the variable to plot), but can be arbitrarily
complex. Some people even go as far as make games with {shiny}.
A version for Python is also in alpha, and you can already experiment with it.
In this section, I will give a very, very short introduction to {shiny}
. This is because {shiny}
is so feature-rich,
that I could spend 20 hours teaching you and even then we would not have seen everything. That being said, we can with
only some cursory knowledge build some useful apps. These apps can run locally on your machine, but they’re really only
useful if deploy them on a server, so that users can then use these web apps on their browsers.
8.4.1 The basic structure of a Shiny app
Shiny apps are always made of at least 2 parts: a server and a ui. In general, each of these parts are in separate
scripts called server.R
and ui.R
. It is possible to have another script, called global.R
, where you can define
variables that you want to be available for both the server and the ui, and to every user of your app.
Let’s start by building a very basic app. This app will allow users to visualize unemployment data for Luxembourg.
For now, let’s say that we want users only to be able to select communes, but not variables. The example code
below is based on this official example
(this is how I recommend you learn by the way. Take a look at the different example there are and adapt them
to suit your needs! You can find the examples here).
Create a folder called something like my_app
and then create three scripts in it:
global.R
server.R
ui.R
Let’s start with global.R
:
library(myPackage)
library(dplyr)
library(ggplot2)
data("unemp")
In the global.R
file, we load the required packages and data. This is now available everywhere. Let’s
continue with the server.R
script:
<- function(session, input, output) {
server
<- reactive(
filtered_data %>%
unemp filter(place_name %in% input$place_name_selected)
)
$unemp_plot <- renderPlot({
output
ggplot(data = filtered_data()) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, color = place_name)) +
labs(title = paste("Unemployment in", paste(input$place_name_selected, collapse = ", ")))
}) }
Several things need to be commented here: first, the script contains a single function, called
server()
. This function take three arguments, session
, input
and output
. I won’t go into
details here, but you should know that you will never call the server()
function yourself, and
that these arguments are required so the function can… function. I will leave a reference at the
end of this section with more details. The next important thing is that we defined an object called
filtered_data
. This is a reactive object. What this means is that this object should get
recomputed every time the user interacts with it. But how does the user interact with it? By
choosing the place_name
he or she wants to see! The predicate inside filter()
is place_name %in% input$place_name_selected
. Where does that input$place_name_selected
come from? This comes
from the ui
(that we have not written yet). But the idea is that the user will be able to chose
place names from a list, and this list will be called place_name_selected
and will contain the
place names that the user wants to see.
Finally, we define a new object called output$unemp_plot
. The goal of the server()
function is
to compute things that will be part of the output
list. This list, and the objects it contains,
get then rendered in the ui
. unemp_plot
is a ggplot graph that uses the reactive data set
we defined first. Notice the ()
after filtered_data
inside the ggplot call. These are required;
this is how we say that the reactive object must be recomputed. If the plot does not get rendered,
the reactive data set does not get computed, since it never gets called.
Ok so now to the ui. Let’s take inspiration from the same example again:
<- function(request){
ui fluidPage(
titlePanel("Unemployment in Luxembourg"),
sidebarLayout(
sidebarPanel(
selectizeInput("place_name_selected", "Select place:",
choices=unique(unemp$place_name),
multiple = TRUE,
selected = c("Rumelange", "Dudelange"),
options = list(
plugins = list("remove_button"),
create = TRUE,
persist = FALSE # keep created choices in dropdown
)
),hr(),
helpText("Original data from STATEC")
),
mainPanel(
plotOutput("unemp_plot")
)
)
)
}
I’ve added some useful things to the ui. First of all, I made it a function of an argument,
request
. This is useful for bookmarking the state of the variable. We’ll add a bookmark button
later. The ui is divided into two parts, a sidebar panel, and a main panel. The sidebar panel is
where you will typically add dropdown menus, checkboxes, radio buttons, etc, for the users to make
various selections. In the main panel, you will show the result of their selections. In the sidebar
panel I add a selectizeInput()
to create a dynamic dropdown list using the selectize
JS
library, included with {shiny}
. The available choices are all the unique place names contained in
our data, I allow users to select multiple place names, by default two communes are selected and
using the options
argument I need little “remove” buttons in the selected commune names. Finally,
in the main panel I use the plotOutput()
function to render the plot. Notice that I use the name
of the plot defined in the server, “unemp_plot”. Finally, to run this, add a new script, called
app.R
and add the following line in it:
::runApp(".") shiny
You can now run this script in RStudio, or from any R console, and this should open a web browser with your app.
Believe it or not, but this app contains almost every ingredient you need to know to build shiny apps. But of course, there are many, many other widgets that you can use to give your users even more ways to interact with applications.
8.4.2 Slightly more advanced shiny
Let’s take a look at another, more complex example. Because this second example is much more complex, let’s first take a look at a video of the app in action:
The global file will be almost the same as before:
library(myPackage)
library(dplyr)
library(ggplot2)
library(g2r)
data("unemp")
enableBookmarking(store = "url")
The only difference is that I load the {g2r}
package to create a nice interactive plot, and enable
bookmarking of the state of the app using enableBookmarking(store = "url")
. Let’s move on to the ui:
<- function(request){
ui fluidPage(
titlePanel("Unemployment in Luxembourg"),
sidebarLayout(
sidebarPanel(
selectizeInput("place_name_selected", "Select place:",
choices=unique(unemp$place_name),
multiple = TRUE,
selected = c("Rumelange", "Dudelange"),
options = list(
plugins = list("remove_button"),
create = TRUE,
persist = FALSE # keep created choices in dropdown
)
),hr(),
# To allow users to select the variable, we add a selectInput
# (not selectizeInput, like above)
# don’t forget to use input$variable_selected in the
# server function later!
selectInput("variable_selected", "Select variable to plot:",
choices = setdiff(unique(colnames(unemp)), c("year", "place_name", "level")),
multiple = FALSE,
selected = "unemployment_rate_in_percent",
),hr(),
# Just for illustration purposes, these radioButtons will not be bound
# to the actionButton.
radioButtons(inputId = "legend_position",
label = "Choose legend position",
choices = c("top", "bottom", "left", "right"),
selected = "right",
inline = TRUE),
hr(),
actionButton(inputId = "render_plot",
label = "Click here to generate plot"),
hr(),
helpText("Original data from STATEC"),
hr(),
bookmarkButton()
),
mainPanel(
# We add a tabsetPanel with two tabs. The first tab show
# the plot made using ggplot the second tab
# shows the plot using g2r
tabsetPanel(
tabPanel("ggplot version", plotOutput("unemp_plot")),
tabPanel("g2r version", g2Output("unemp_plot_g2r"))
)
)
)
)
}
There are many new things. Everything is explained in the comments within the script itself so take a look at them. What’s important to notice, is that I now added two buttons, an action button, and a bookmark button. The action button will be used to draw the plots. This means that the user will choose the options for the plot, and then the plot will only appear once the user clicks on the button. This is quite useful in cases where computations take time to run, and you don’t want the every reactive object to get recomputed as soon as the user interacts with the app. This way, only once every selection has been made can the user give the green light to the app to compute everything.
At the bottom of the ui you’ll see that I’ve added a tabsetPanel()
with some tabPanel()
s. This
is where the graphs “live”. Let’s move on to the server script:
<- function(session, input, output) {
server
# Because I want the plots to only render once the user clicks the
# actionButton, I need to move every interactive, or reactive, element into
# an eventReactive() function. eventReactive() waits for something to "happen"
# in order to let the reactive variables run. If you don’t do that, then
# when the user interacts with app, these reactive variables will run
# which we do not want.
# Data only gets filtered once the user clicks on the actionButton
<- eventReactive(input$render_plot, {
filtered_data %>%
unemp filter(place_name %in% input$place_name_selected)
})
# The variable the user selects gets passed down to the plot only once the user
# clicks on the actionButton.
# If you don’t do this, what will happen is that the variable will then update the plot
# even when the user does not click on the actionButton
<- eventReactive(input$render_plot, {
variable_selected $variable_selected
input
})
# The plot title only gets generated once the user clicks on the actionButton
# If you don’t do this, what will happen is that the title of the plot will get
# updated even when the user does not click on the actionButton
<- eventReactive(input$render_plot, {
plot_title paste(variable_selected(), "for", paste(input$place_name_selected, collapse = ", "))
})
$unemp_plot <- renderPlot({
output
ggplot(data = filtered_data()) +
theme_minimal() +
# Because the selected variable is a string, we need to convert it to a symbol
# using rlang::sym and evaluate it using !!. This is because the aes() function
# expects bare variable names, and not strings.
# Because this is something that developers have to use often in shiny apps,
# there is a version of aes() that works with strings, called aes_string()
# You can use both approaches interchangeably.
#geom_line(aes(year, !!rlang::sym(variable_selected()), color = place_name)) +
geom_line(aes_string("year", variable_selected(), color = "place_name")) +
labs(title = plot_title()) +
theme(legend.position = input$legend_position)
})
$unemp_plot_g2r <- renderG2({
output
g2(data = filtered_data()) %>%
# g2r’s asp() requires bare variable names
fig_line(asp(year, !!rlang::sym(variable_selected()), color = place_name)) %>%
# For some reason, the title does not show...
subject(plot_title()) %>%
legend_color(position = input$legend_position)
}) }
What’s new here, is that I now must redefine the reactive objects in such a way that they only get
run once the user clicks the button. This is why every reactive object (but one, the position of the
legend) is now wrapped by eventReactive()
. eventReactive()
waits for a trigger, in this case
the clicking of the action button, to run the reactive object. eventReactive()
takes the action button
ID as an input. I’ve also defined the plot title as a reactive value, not only the dataset as before,
because if I didn’t do it, then the title of the plot would get updated as the user would choose
other communes, but the contents of the plot, that depend on the data, would not get updated. To avoid
the title and the plot to get desynched, I need to also wrap it around eventReactive()
. You can see
this behaviour by changing the legend position. The legend position gets updated without the user needing
to click the button. This is because I have not wrapped the legend position inside eventReactive()
.
Finally, I keep the {ggplot2}
graph, but also remake it using {g2r}
, to illustrate how it works
inside a Shiny app.
To conclude this section, we will take a look at one last app. This app will allow users to do data aggregation on relatively large dataset, so computations will take some time. The app will illustrate how to best deal with this.
8.4.3 Basic optimization of Shiny apps
The app we will build now requires what is sometimes referred to medium size data. Medium size data is data that is far from being big data, but already big enough that handling it requires some thought, especially in this scenario. What we want to do is build an app that will allow users to do some aggregations on this data. Because the size of the data is not trivial, these computations will take some time to run. So we need to think about certain strategies to avoid frustrating our users. The file we will be using can be downloaded from here. We’re not going to use the exact same data set though, I have prepared a smaller version that will be more than enough for our purposes. But the strategies that we are going to implement here will also work for the original, much larger, dataset. You can get the smaller version here. Uncompressed it’ll be a 2.4GB file. Not big data in any sense, but big enough to be annoying to handle without the use of some optimization strategies.
One such strategy is only letting the computations run once the user gives the green light by clicking
on an action button. This is what we have seen in the previous example. The next obvious strategy
is to use packages that are optimized for speed. It turns out that the functions we have seen
until now, from packages like {dplyr}
and the like, are not the fastest. Their ease of use and
expressiveness come at a speed cost. So we will need to switch to something faster. We will do the same
to read in the data.
This faster solution is the {arrow}
package, which is an interface to the Arrow software developed by Apache.
The final strategy is to enable caching in the app.
So first, install the {arrow}
package by running install.packages("arrow")
. This will compile
libarrow
from source on Linux and might take some time, so perhaps go grab a coffee.
Before building the app, let me perform a very simple benchmark. The script below reads in the data,
then performs some aggregations. This is done using standard {tidyverse}
functions, but also
using {arrow}
:
<- Sys.time()
start_tidy # {vroom} is able to read in larger files than {readr}
# I could not get this file into R using readr::read_csv
# my RAM would get maxed out
<- vroom::vroom("data/combined")
air
<- air |>
mean_dep_delay ::group_by(Year, Month, DayofMonth) |>
dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))
dplyr<- Sys.time()
end_tidy
<- end_tidy - start_tidy
time_tidy
<- Sys.time()
start_arrow <- arrow::open_dataset("data/combined", format = "csv")
air
<- air |>
mean_dep_delay ::group_by(Year, Month, DayofMonth) |>
dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))
dplyr<- Sys.time()
end_arrow
- start_tidy
end_tidy - start_arrow end_arrow
The “tidy” approach took 17 seconds, while the arrow approach took 6 seconds. This is an impressive improvement, but put yourself in the shoes of a user who has to wait 6 seconds for each query. That would get very annoying, very quickly. So the other strategy that we will use is to provide some visual cue that computations are running, and then we will go one step further and use caching of results in the Shiny app.
But before we continue, you may be confused by the code above. After all, I told you before that
functions from {dplyr}
and the like were not the fastest, and yet, I am using them in the arrow
approach as well, and they now run almost 3 times as fast. What’s going on? What’s happening here,
is that the air
object that we read using arrow::open_dataset
is not a dataframe, but an arrow
dataset. These are special, and work in a different way. But that’s not what’s important: what’s important
is that the {dplyr}
api can be used to work with these arrow
datasets. This means that functions
from {dplyr}
change the way they work depending on the type of the object their dealing with.
If it’s a good old regular data frame, some C++ code gets called to perform the computations. If it’s
an arrow
dataset, libarrow
and its black magic get called instead to perform the computations.
If you’re familiar with the concept of
polymorphism this is it
(think of +
in Python: 1+1
returns 2
, "a"+"b"
returns "a+b"
. A different computation
gets performed depending on the type of the function’s inputs).
Let’s now build a basic version of the app, only
using {arrow}
functions for speed. This is the global file:
library(arrow)
library(dplyr)
library(rlang)
library(DT)
<- arrow::open_dataset("data/combined", format = "csv") air
The ui will be quite simple:
<- function(request){
ui fluidPage(
titlePanel("Air On Time data"),
sidebarLayout(
sidebarPanel(
selectizeInput("group_by_selected", "Variables to group by:",
choices = c("Year", "Month", "DayofMonth", "Origin", "Dest"),
multiple = TRUE,
selected = c("Year", "Month"),
options = list(
plugins = list("remove_button"),
create = TRUE,
persist = FALSE # keep created choices in dropdown
)
),hr(),
selectizeInput("var_to_average", "Select variable to average by groups:",
choices = c("ArrDelay", "DepDelay", "Distance"),
multiple = FALSE,
selected = "DepDelay",
),hr(),
actionButton(inputId = "run_aggregation",
label = "Click here to run aggregation"),
hr(),
bookmarkButton()
),
mainPanel(
DTOutput("result")
)
)
)
}
And finally the server:
<- function(session, input, output) {
server
# Numbers get crunched only when the user clicks on the action button
<- eventReactive(input$run_aggregation, {
grouped_data %>%
air group_by(!!!syms(input$group_by_selected)) %>%
summarise(result = mean(!!sym(input$var_to_average),
na.rm = TRUE)) %>%
as.data.frame()
})
$result <- renderDT({
outputgrouped_data()
})
}
Because group_by()
and mean()
expect bare variable names, I convert them from strings to
symbols using rlang::syms()
and rlang::sym()
. The difference between the two is that
rlang::syms()
is required when a list of strings gets passed down to the function (remember
that the user must select several variables to group by), and this is also why !!!
are needed
(to unquote the list of symbols). Finally, the computed data must be converted back to a
data frame using as.data.frame()
. This is actually when the computations happen. {arrow}
collects
all the aggregations but does not perform anything until absolutely required. Let’s see the app
in action:
As you can see, in terms of User Experience (UX) this is quite poor. When the user clicks on the button nothing seems to be going on for several seconds, until the table appears. Then, when the user changes some options and clicks again on the action button, it looks like the app is crashing.
Let’s add some visual cues to indicate to the user that something is happening when the button gets
clicked. For this, we are going to use the {shinycssloaders}
package:
install.packages("shinycssloaders")
and simply change the ui to this (and don’t forget to load {shinycssloaders}
in the global script!):
<- function(request){
ui fluidPage(
titlePanel("Air On Time data"),
sidebarLayout(
sidebarPanel(
selectizeInput("group_by_selected", "Variables to group by:",
choices = c("Year", "Month", "DayofMonth", "Origin", "Dest"),
multiple = TRUE,
selected = c("Year", "Month"),
options = list(
plugins = list("remove_button"),
create = TRUE,
persist = FALSE # keep created choices in dropdown
)
),hr(),
selectizeInput("var_to_average", "Select variable to average by groups:",
choices = c("ArrDelay", "DepDelay", "Distance"),
multiple = FALSE,
selected = "DepDelay",
),hr(),
actionButton(inputId = "run_aggregation",
label = "Click here to run aggregation"),
hr(),
bookmarkButton()
),
mainPanel(
# We add a tabsetPanel with two tabs. The first tab show the plot made using ggplot
# the second tab shows the plot using g2r
DTOutput("result") |>
withSpinner()
)
)
)
}
The only difference with before is that now the DTOutput()
right at the end gets passed down
to withSpinner()
. There are several spinners that you can choose, but let’s simply use the
default one. This is how the app looks now:
Now the user gets a visual cue that something is happening. This makes waiting more bearable,
but even better than waiting with a spinner is no waiting at all. For this, we are going to enable caching
of results. There are several ways that you can cache results inside your app. You can enable
the cache on a per-user and per-session basis, or only on a per-user basis. But I think that
in our case here, the ideal caching strategy is to keep the cache persistent, and available
across sessions. This means that each computation done by any user will get cached and available
to any other user. In order to achieve this, you simply have to install the {cachem}
packages
add the following lines to the global script:
shinyOptions(cache = cachem::cache_disk("./app-cache",
max_age = Inf))
By setting the max_age
argument to Inf
, the cache will never get pruned. The maximum size
of the cache, by default is 1GB. You can of course increase it.
Now, you must also edit the server file like so:
<- function(session, input, output) {
server
# Numbers get crunched only when the user clicks on the action button
<- reactive({
grouped_data %>%
air group_by(!!!syms(input$group_by_selected)) %>%
summarise(result = mean(!!sym(input$var_to_average),
na.rm = TRUE)) %>%
as.data.frame()
%>%
}) bindCache(input$group_by_selected,
$var_to_average) %>%
inputbindEvent(input$run_aggregation)
$result <- renderDT({
outputgrouped_data()
})
}
We’ve had to change eventReactive()
to reactive()
, just like in the app where we don’t use
an action button to run computations. Then, we pass the reactive object to bindCache()
.
bindCache()
also takes the inputs
as arguments. These are used to generate cache keys to retrieve the correct objects from cache.
Finally, we pass all this to bindEvent()
. This function takes the input referencing the
action button. This is how we can now bind the computations to the button once again. Let’s
test our app now. You will notice that the first time we choose certain options, the computations
will take time, as before. But if we perform the same computations again, then the results will be shown
instantly:
As you can see, once I go back to a computation that was done in the past, the table appears
instantly. At the end of the video I open a terminal and navigate to the directory of the app,
and show you the cache. There are several .Rds
objects, these are the final data frames that
get computed by the app. If the user wants to rerun a previous computation, the correct data frame
gets retrieved, making it look like the computation happened instantly, and with another added
benefit: as discussed above, the cache is persistent between sessions, so even if the user
closes the browser and comes back later, the cache is still there, and other users will also
benefit from the cache.
8.4.4 Deploying your shiny app
The easiest way is certainly to use shinyapps.io. I won’t go into details, but you can read more about it here. You could also get a Virtual Private Server on a website like Vultr or DigitalOcean. When signing up with these services you get some free credit to test things out. If you use my Vultr referral link you get 100USD to test the platform. This is more than enough to get a basic VPS with Ubuntu on it. You can then try to install everything needed to deploy Shiny apps from your VPS. You could follow this guide to deploy from DigitalOcean, which should generalize well to other services like Vultr. Doing this will teach you a lot, and I would highly recommend you do it.
8.5 How to build data products using {targets}
We will now put everything together and create a {targets}
pipeline to build a data product
from start to finish. Let’s go back to one of the pipelines we wrote in Chapter 7. If you’re
using RStudio, start a new project and make it renv
-enabled by checking the required checkbox.
If you’re using another editor, start with an empty folder and run renv::init()
. Now create a new
script with the following code:
library(targets)
library(myPackage)
library(dplyr)
library(ggplot2)
source("functions.R")
list(
tar_target(
unemp_data,get_data()
),
tar_target(
lux_data,clean_unemp(unemp_data,
place_name_of_interest = "Luxembourg",
level_of_interest = "Country",
col_of_interest = active_population)
),
tar_target(
canton_data,clean_unemp(unemp_data,
level_of_interest = "Canton",
col_of_interest = active_population)
),
tar_target(
commune_data,clean_unemp(unemp_data,
place_name_of_interest = c("Luxembourg", "Dippach", "Wiltz", "Esch/Alzette", "Mersch"),
col_of_interest = active_population)
),
tar_target(
lux_plot,make_plot(lux_data)
),
tar_target(
canton_plot,make_plot(canton_data)
),
tar_target(
commune_plot,make_plot(commune_data)
)
)
This pipeline reads in data, then filters data and produces some plots. In another version of this
pipeline we wrote the plots to disk. Now we will add them to a Quarto document, using the
tar_quarto()
function that can be found in the {tarchetypes}
packages (so install it if this is
not the case yet). {tarchetypes}
provides functions to define further types of targets, such as
tar_quarto()
which makes it possible to render Quarto documents from a {targets}
pipeline. But
before rendering a document, we need to write this document. This is what the document could look
like:
---
title: "Reading objects from a targets pipeline"
author: "Bruno Rodrigues"
date: today
---
This document loads three plots that were made using a `{targets}` pipeline.
```{r}
targets::tar_read(lux_plot)
```
```{r}
targets::tar_read(canton_plot)
```
```{r}
targets::tar_read(commune_plot) ```
Here is what the final pipeline would look like (notice that I’ve added library(quarto)
to the
list of packages getting called):
library(targets)
library(tarchetypes)
library(myPackage)
library(dplyr)
library(ggplot2)
library(quarto)
source("functions.R")
list(
tar_target(
unemp_data,get_data()
),
tar_target(
lux_data,clean_unemp(unemp_data,
place_name_of_interest = "Luxembourg",
level_of_interest = "Country",
col_of_interest = active_population)
),
tar_target(
canton_data,clean_unemp(unemp_data,
level_of_interest = "Canton",
col_of_interest = active_population)
),
tar_target(
commune_data,clean_unemp(unemp_data,
place_name_of_interest = c("Luxembourg", "Dippach", "Wiltz", "Esch/Alzette", "Mersch"),
col_of_interest = active_population)
),
tar_target(
lux_plot,make_plot(lux_data)
),
tar_target(
canton_plot,make_plot(canton_data)
),
tar_target(
commune_plot,make_plot(commune_data)
),
tar_quarto(
my_doc,path = "my_doc.qmd"
)
)
Make sure that this pipeline runs using tar_make()
. If yes, and you’re done with it, don’t forget to run
renv::snapshot()
to save the projects dependencies in a lock file. Again, take a look at the lock file
to make extra sure that your package is correctly being versioned. As a reminder, you should see something
like this:
"myPackage": {
"Package": "myPackage",
"Version": "0.1.0",
"Source": "GitHub",
"RemoteType": "github",
"RemoteHost": "api.github.com",
"RemoteRepo": "myPackage",
"RemoteUsername": "b-rodrigues",
"RemoteRef": "e9d9129de3047c1ecce26d09dff429ec078d4dae",
"RemoteSha": "e9d9129de3047c1ecce26d09dff429ec078d4dae",
"Hash": "4740b43847e10e012bad2b8a1a533433",
"Requirements": [
"dplyr",
"janitor",
"rlang"
]
},
What’s really important is that you find the “RemoteXXXX” fields. We are now ready to push this
project to github.com. Don’t forget to first edit the .gitignore
file and add the renv
folder
in it. This is the folder that contains the downloaded packages, and it can get quite big. It is
better to not push it. We are now done with building an almost 100% reproducible pipeline!
If your product is a Shiny app, you may want to put as much calculations as possible in the
{targets}
pipelines. You can then use tar_load()
or tar_read()
inside the global.R
file.