Chapter 1 Introduction

Cover image created with Dall-e. The prompt was 'Roman engineer building a pipeline in the style of ancient roman art'. The result doesn't really show that, but I thought it looked nice.

This a course that I taught during the winter semester at the University of Luxembourg in 2022. It was aimed at second year Master’s students.

1.1 Schedule

2022/10/25, morning: Introduction and data exploration with R
2022/10/25, afternoon: Functional programming
2022/10/31, afternoon: Git (optional?)
2022/11/07, afternoon: Package development and unit testing
2022/11/08, morning: Build automation
2022/11/08, afternoon: Building Data products
2022/11/21, afternoon: Self-contained pipelines with Docker
2022/11/22, morning: CI/CD with Github Actions

1.2 Reproducible analytical pipelines?

This course is my take on setting up code that results in some data product. This code has to be reproducible, documented and production ready. Not my original idea, but introduced by the UK’s Analysis Function.

The basic idea of a reproducible analytical pipeline (RAP) is to have code that always produces the same result when run, whatever this result might be. This is obviously crucial in research and science, but this is also the case in businesses that deal with data science/data-driven decision making etc.

A well documented RAP avoids a lot of headache and is usually re-usable for other projects as well.

1.3 Data products?

In this course each of you will develop a data product. A data product is anything that requires data as an input. This can be a very simple report in PDF or Word format or a complex web app. This website is actually also a data product, which I made using the R programming language. In this course we will not focus too much on how to create automated reports or web apps (but I’ll give an introduction to these, don’t worry) but our focus will be on how to set up a pipeline that results in these data products in a reproducible way.

1.4 Machine learning?

No, being a master in machine learning is not enough to become a data scientist. Actually, the older I get, the more I think that machine learning is almost optional. What is not optional is knowing how:

to write, test, and properly document code;
to acquire (reading in data can be tricky!) and clean data;
to work inside the Linux terminal/command line interface;
to use Git, Docker for Dev(Git)Ops;
the Internet works (what’s a firewall? what’s a reverse proxy? what’s a domain name? etc, etc…);

But what about machine learning? Well, depending what you’ll end up doing, you might indeed focus a lot on machine learning and/or statistical modeling. That being said, in practice, it is very often much more efficient to let some automl algorithm figure out the best hyperparameters of a XGBoost model and simply use that, at least as a starting point (but good luck improving upon automl…). What matters, is that the data you’re feeding to your model is clean, that your analysis is sensible, and most importantly, that it could be understood by someone taking over (imagine you get sick) and rerun with minimal effort in the future. The model here should simply be a piece that could be replaced by another model without much impact. The model is rarely central… but of course there are exceptions to this, especially in research, but every other point I’ve made still stands. It’s just that not only do you have to care about your model a lot, you also have to care about everything else.

So in this course we’re going to learn a bit of all of this. We’re going to learn how to write reusable code, learn some basics of the Linux command line, Git and Docker.

1.5 Why R? Why not [insert your favourite programming language]

In my absolutely objective opinion R is currently the most interesting and simple language you can use to create such data products. If you learn R you have access to almost 19’000 packages (as of October 2022) to:

clean data (see: {dplyr}, {tidyr}, {data.table}…);
work with medium and big data (see: {arrow}, {sparklyr}…);
visualize data (see: {ggplot2}, {plotly}, {echarts4r}…);
do literate programming (using Rmarkdown or Quarto, you can write books, documents even create a website);
do functional programming (see: {purrr}…);
call other languages from R (see: {reticulate} to call Python from R);
do machine learning and AI (see: {tidymodels}, {tensorflow}, {keras}…)
create webapps (see: {shiny}…)
domain specific statistics/machine learning (see CRAN Task Views for an exhaustive list);
and more

It’s not just about what the packages provide: installing R and its packages and dependencies is rarely frustrating, which is not the case with Python (Python 2 vs Python 3, pip vs conda, pyenv vs venv…, dependency hell is a real place full of snakes)

That doesn’t mean that R does not have any issues. Quite the contrary, R sometimes behaves in seemingly truly bizarre ways (as an example, try running nchar("1000000000") and then nchar(1000000000) and try to make sense of it). To know more about such bizarre behaviour, I recommend you read The R Inferno (linked at the end of this chapter). So, yes, R is far from perfect, but it sucks less than the alternatives (again, in my absolutely objective opinion).

1.6 Pre-requisites

I will assume basic programming knowledge, and not much more. If you need to set up R on your computer you can read the intro to my other book Modern R with the tidyverse. Follow the pre-requisites there: install R, RStudio and these packages:

install.packages(c(
  "Ecdat",
  "devtools",
  "janitor",
  "plm",
  "pwt9",
  "quarto",
  "renv",
  "rio",
  "shiny",
  "targets",
  "tarchetypes",
  "testthat",
  "tidyverse",
  "tribble",
  "usethis"
))

The course will be very, very hands-on. I’ll give general hints and steps, and ask you to do stuff. It will not always be 100% simple and obvious, and you will need to also think a bit by yourself. I’ll help of course, so don’t worry. The idea is to put you in the shoes of a real data scientist that gets asked at 9 in the morning to come up with a solution to a problem by COB. In 99% of the cases, you will never have encountered that problem ever, as it will be very specific to the company you’re working at. Google and Stackoverflow will be your only friends in these moments.

The beginning of this course will likely be the toughest part, especially if you’re not familiar with R. I will need to bring you up to speed in 6 hours. Only after can we actually start talking about RAPs. What’s important is to never give up and work together with me.

1.7 Grading

The way grading works in this course is as follows: during lecture hours you will follow along. At home, you’ll be working on setting up your own pipeline. For this, choose a dataset that ideally would need some cleaning and/or tweaking to be usable. We are going first to learn how to package this dataset alongside some functions to make it clean. If time allows, I’ll leave some time during lecture hours for you to work on it and ask me and your colleagues for help. At the end of the semester, I will need to download your code and get it running. The less effort this takes me, the better your score. Here is a tentative breakdown:

Code is on github.com and I can pull it: 2 points;
Data and functions to run pipeline are in a tested, documented package? 3 points;
I don’t need to do anything to load data: 5 points;
I can download and install your pipeline’s dependencies in one command line: 5 points;
I can run your pipeline in one command line: 5 points
Extra points: pipeline is dockerized and uses github actions to run? 5 points

The way to fail this class is to write an undocumented script that only runs on your machine and expect me to debug it to get it to run.

1.8 Jargon

There’s some jargon that is helpful to know when working with R. Here’s a non-exhaustive list to get you started:

CRAN: the Comprehensive R Archive Network. This is a curated online repository of packages and R installers. When you type install.packages("package_name") in an R console, the package gets downloaded from there;
Library: the collection of R packages installed on your machine;
R console: the program where the R interpreter runs;
Posit/RStudio: Posit (named RStudio in the past) are the makers of the RStudio IDE and of the tidyverse collection of packages;
tidyverse: a collection of packages created by Posit that offer a common language and syntax to perform any task required for data science — from reading in data, to cleaning data, up to machine learning and visualisation;
base R: refers to a vanilla installation (and vanilla capabilities) of R. Often used to contrast a tidyverse specific approach to a problem (for example, using base R’s lapply() in constrast to the tidyverse purrr::map()).
package::function(): Functions can be accessed in several ways in R, either by loading an entire package at the start of a script with library(dplyr) or by using dplyr::select().
Function factory (sometimes adverb): a function that returns a function.
Variable: the variable of a function (as in x in f(x)) or the variable from statistical modeling (synonym of feature)
<- vs =: in practice, you can use <- and = interchangeably. I prefer <-, but feel free to use = if you wish.

1.9 Further reading

1.10 License

This course is licensed under the WTFPL.

Reproducible Analytical Pipelines - Master’s of Data Science