A real world example • chronicler

library(chronicler)

This is practically the same code you can find on this blog post of mine: https://www.brodrigues.co/blog/2018-11-14-luxairport/ but with some minor updates to reflect the current state of the tidyverse packages as well as logging using {chronicler}.

Let’s first load the required packages, and the avia dataset included in the {chronicler} package:

library(chronicler)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:chronicler':
#> 
#>     pick
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

# Ensure chronicler version of `pick()` is being used
pick <- chronicler::pick

data("avia")

Now I need to define the needed functions for the analysis. To improve logging, I add the dim() function as the .g argument of each function below. This will make it possible to see how the dimensions of the data change inside the pipeline:

# Define required functions
# You can use `record_many()` to avoid having to write everything

r_select <- record(select, .g = dim)
r_pivot_longer <- record(pivot_longer, .g = dim)
r_filter <- record(filter, .g = dim)
r_separate <- record(separate, .g = dim)
r_group_by <- record(group_by, .g = dim)
r_summarise <- record(summarise, .g = dim)

We can now start by preparing the data:

avia_clean <- avia %>%
  r_select(1, contains("20")) %>% # select the first column and every column starting with 20
  bind_record(r_pivot_longer,
              -starts_with("freq"),
              names_to = "date",
              values_to = "passengers") %>%
  bind_record(r_separate,
              col = 1,
              into = c("freq", "unit", "tra_meas", "air_pr\\time"),
              sep = ",")

Let’s take a look at the data:

avia_clean
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 464,616 × 6
#>    freq  unit   tra_meas `air_pr\\time`  date    passengers
#>    <chr> <chr>  <chr>    <chr>           <chr>   <chr>     
#>  1 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2000    :         
#>  2 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2001    :         
#>  3 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2002    :         
#>  4 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2003    :         
#>  5 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2004    :         
#>  6 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005    :         
#>  7 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-01 143       
#>  8 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-02 134       
#>  9 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-03 154       
#> 10 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-04 142       
#> # ℹ 464,606 more rows
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).

The passengers column contains ":" characters instead of NAs, and it’s a character column. Let’s convert this column to numbers:

r_mutate <- record(mutate, .g = dim)

avia_clean2 <- avia_clean %>%
  bind_record(r_mutate,
              passengers = as.numeric(passengers))

Let’s look at the data:

avia_clean2
#> NOK! Value computed unsuccessfully:
#> ---------------
#> Nothing
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).

What happened? Let’s read the log to find out!

read_log(avia_clean2)
#> [1] "Complete log:"                                                                                                                                                                                                                                                                                                       
#> [2] "OK! select(1,contains(\"20\")) ran successfully at 2024-02-12 21:06:59.346958"                                                                                                                                                                                                                                       
#> [3] "OK! pivot_longer(-starts_with(\"freq\"),date,passengers) ran successfully at 2024-02-12 21:06:59.346865"                                                                                                                                                                                                             
#> [4] "OK! separate(1,c(\"freq\", \"unit\", \"tra_meas\", \"air_pr\\\\time\"),,) ran successfully at 2024-02-12 21:06:59.346518"                                                                                                                                                                                            
#> [5] "NOK! mutate(as.numeric(passengers)) ran unsuccessfully with following exception: \033[1m\033[22mThere was 1 warning in `.f()`.\n\033[1m\033[22m\033[36mℹ\033[39m In argument: `passengers = as.numeric(passengers)`.\nCaused by warning:\n\033[33m!\033[39m NAs introduced by coercion at 2024-02-12 21:07:06.081974"
#> [6] "Total running time: 7.11056327819824 secs"

So what happened is that as.numeric() introduced NAs by coercion. This is what happens when trying to convert a character to a number, for example as.numeric(":") will result in an NA. Because mutate() was recorded with the default value for its strict argument (which is 2), warnings get promoted to errors. This can be quite useful to avoid problems with silent conversions. But in this case, we want to ignore the warning: let’s record mutate() with strict = 1, so that only errors can stop the pipeline:

r_mutate_lenient <- record(mutate, .g = dim, strict = 1)

avia_clean2 <- avia_clean %>%
  bind_record(r_mutate_lenient,
              passengers = as.numeric(passengers)
              )
#> Warning: There was 1 warning in `.f()`.
#> ℹ In argument: `passengers = as.numeric(passengers)`.
#> Caused by warning:
#> ! NAs introduced by coercion

As you can see, the warnings get printed, they’re not captured. We can now take a look at the data and see that ":" characters where successfully replaced by NAs:

avia_clean2
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 464,616 × 6
#>    freq  unit   tra_meas `air_pr\\time`  date    passengers
#>    <chr> <chr>  <chr>    <chr>           <chr>        <dbl>
#>  1 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2000            NA
#>  2 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2001            NA
#>  3 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2002            NA
#>  4 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2003            NA
#>  5 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2004            NA
#>  6 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005            NA
#>  7 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-01        143
#>  8 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-02        134
#>  9 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-03        154
#> 10 M     FLIGHT CAF_PAS  LU_ELLX_AT_LOWW 2005-04        142
#> # ℹ 464,606 more rows
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).

Let’s continue and focus on monthly data:

avia_monthly <- avia_clean2 %>%
  bind_record(r_filter,
              freq == "M",
              tra_meas == "PAS_BRD_ARR",
              !is.na(passengers)) %>%
  bind_record(r_mutate,
              date = paste0(date, "01"),
              date = ymd(date)) %>%
  bind_record(r_select,
              destination = "air_pr\\time", date, passengers)

To make sure I only have monthly data, I can count the values of the date column using dplyr::count(). But because avia_monthly is not a data frame, but a chronicle I need to record() the dplyr::count() function. But because I only need it this once, I could instead use fmap_record(), which makes it possible to apply an undecorated function to a chronicle object:

fmap_record(avia_monthly, count, date)
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 226 × 2
#>    date           n
#>    <date>     <int>
#>  1 2005-01-01    23
#>  2 2005-02-01    23
#>  3 2005-03-01    23
#>  4 2005-04-01    24
#>  5 2005-05-01    24
#>  6 2005-06-01    24
#>  7 2005-07-01    24
#>  8 2005-08-01    24
#>  9 2005-09-01    24
#> 10 2005-10-01    24
#> # ℹ 216 more rows
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).

avia_monthly is an object of class chronicle, but in essence, it is just a list, with its own print method:

avia_monthly
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 6,643 × 3
#>    destination     date       passengers
#>    <chr>           <date>          <dbl>
#>  1 LU_ELLX_AT_LOWW 2005-01-01       1758
#>  2 LU_ELLX_AT_LOWW 2005-02-01       1843
#>  3 LU_ELLX_AT_LOWW 2005-03-01       2129
#>  4 LU_ELLX_AT_LOWW 2005-04-01       2332
#>  5 LU_ELLX_AT_LOWW 2005-05-01       2402
#>  6 LU_ELLX_AT_LOWW 2005-06-01       2475
#>  7 LU_ELLX_AT_LOWW 2005-07-01       2082
#>  8 LU_ELLX_AT_LOWW 2005-08-01       2175
#>  9 LU_ELLX_AT_LOWW 2005-09-01       2288
#> 10 LU_ELLX_AT_LOWW 2005-10-01       2296
#> # ℹ 6,633 more rows
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).

Now that the data is clean, we can read the log:

read_log(avia_monthly)
#> [1] "Complete log:"                                                                                                           
#> [2] "OK! select(1,contains(\"20\")) ran successfully at 2024-02-12 21:06:59.346958"                                           
#> [3] "OK! pivot_longer(-starts_with(\"freq\"),date,passengers) ran successfully at 2024-02-12 21:06:59.346865"                 
#> [4] "OK! separate(1,c(\"freq\", \"unit\", \"tra_meas\", \"air_pr\\\\time\"),,) ran successfully at 2024-02-12 21:06:59.346518"
#> [5] "OK! mutate(as.numeric(passengers)) ran successfully at 2024-02-12 21:07:06.676081"                                       
#> [6] "OK! filter(freq == \"M\",tra_meas == \"PAS_BRD_ARR\",!is.na(passengers)) ran successfully at 2024-02-12 21:07:06.940474" 
#> [7] "OK! mutate(paste0(date, \"01\"),ymd(date)) ran successfully at 2024-02-12 21:07:06.940378"                               
#> [8] "OK! select(air_pr\\time,date,passengers) ran successfully at 2024-02-12 21:07:06.940263"                                 
#> [9] "Total running time: 6.89280724525452 secs"

This is especially useful if the object avia_monthly gets saved using saveRDS(). People can then read this object, can read the log to know what happened and reproduce the steps if necessary.

Let’s take a look at the final data set:

avia_monthly %>%
  pick("value")
#> # A tibble: 6,643 × 3
#>    destination     date       passengers
#>    <chr>           <date>          <dbl>
#>  1 LU_ELLX_AT_LOWW 2005-01-01       1758
#>  2 LU_ELLX_AT_LOWW 2005-02-01       1843
#>  3 LU_ELLX_AT_LOWW 2005-03-01       2129
#>  4 LU_ELLX_AT_LOWW 2005-04-01       2332
#>  5 LU_ELLX_AT_LOWW 2005-05-01       2402
#>  6 LU_ELLX_AT_LOWW 2005-06-01       2475
#>  7 LU_ELLX_AT_LOWW 2005-07-01       2082
#>  8 LU_ELLX_AT_LOWW 2005-08-01       2175
#>  9 LU_ELLX_AT_LOWW 2005-09-01       2288
#> 10 LU_ELLX_AT_LOWW 2005-10-01       2296
#> # ℹ 6,633 more rows

It is also possible to take a look at the underlying .log_df object that contains more details, and see the output of the .g argument (which was defined in the beginning as the dim() function):

check_g(avia_monthly)
#>   ops_number     function         g
#> 1          1       select 1434, 325
#> 2          2 pivot_longer 464616, 3
#> 3          3     separate 464616, 6
#> 4          4       mutate 464616, 6
#> 5          5       filter   6643, 6
#> 6          6       mutate   6643, 6
#> 7          7       select   6643, 3

After select() the data has hu[[1]][1] rows and hu[[1]][2] columns, after the call to pivot_longer(), hu[[2]][1] rows and hu[[2]][2] columns, separate() adds three columns, after filter() only hu[[5]][1] rows remain (mutate() does not change the dimensions) and then select() is used to remove three columns.