Building fully reproducible data science environments for R and Python with ease using nix, rix, and Docker

Bruno Rodrigues

Intro: Who am I

Bruno Rodrigues, head of the statistics at the Ministry of Research and Higher education in Luxembourg

Slides available online at https://is.gd/repro_basf

Code available at: https://github.com/b-rodrigues/repro_basf

Topics I want to talk about

  • What I mean by reproducibility

  • What is Nix, how it works and its complementary relationship to Docker

  • What I will not discuss (but is very useful!):

    • FP, Git, Documenting, testing and packaging code, build automation

What I mean by reproducibility

  • Ability to recover exactly the same results from an analysis
  • Why would you want that?
  • Auditing purposes
  • Updating data (impact should only come from data changes)
  • Reproducibility as a cornerstone of science
  • (Work on an immutable dev environment)

Making our scripts reproducible

We need to answer these questions

  1. How easy would it be for someone else to rerun the analysis?
  2. How easy would it be to update the project?
  3. How easy would it be to reuse this code for another project?
  4. What guarantee do we have that the output is stable over time?

Reproducibility is on a continuum (1/2)

Here are the 4 main things influencing an analysis’ reproducibility:

  • Version of R used
  • Versions of packages used
  • Operating system
  • Hardware

Reproducibility is on a continuum (2/2)

Source: Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27

Package versioning with renv

  • renv is commonly used
  • Run renv::init() to generate library snapshot as a renv.lock file

What an renv.lock file looks like

{
"R": {
  "Version": "4.2.2",
  "Repositories": [
  {
   "Name": "CRAN",
   "URL": "https://packagemanager.rstudio.com/all/latest"
  }
  ]
},
"Packages": {
  "MASS": {
    "Package": "MASS",
    "Version": "7.3-58.1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "762e1804143a332333c054759f89a706",
    "Requirements": []
  },
  "Matrix": {
    "Package": "Matrix",
    "Version": "1.5-1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "539dc0c0c05636812f1080f473d2c177",
    "Requirements": [
      "lattice"
    ]

    ***and many more packages***

Restoring a library using an renv.lock file

  • renv.lock file not just a record
  • Can be used to restore as well!
  • Run renv::restore()

{renv} alone is not enough

Shortcomings:

  1. Records, but does not restore the version of R
  2. Installation of old packages can fail (due to missing OS-dependencies)

but… :

  1. Generating a renv.lock file is “free”
  2. Provides a blueprint for dockerizing our pipeline
  3. Creates a project-specific library (no interferences)

Going further with Docker: handling R and system-level dependencies

  • Docker is a containerisation tool
  • Docker allows you to build images and run containers (a container is an instance of an image)
  • Docker images:
    1. contain all the software and code needed for your project
    2. are immutable (cannot be changed at run-time)
    3. can be shared on- and offline

Without Docker

With Docker

Dockerizing a project (1/3)

Dockerizing a project could look like this:

  • At image build-time:
    1. install R and R packages (or use an image that ships R and renv.lock)
    2. copy all scripts to the image
    3. run the analysis (non-interactively)
  • At container run-time:
    1. copy the outputs of the analysis from the container to your computer
    2. possible to “log in” into a running container to inspect code and outputs

Dockerizing a project (2/3)

  • Restoring packages with {renv} can be tricky:
#> * installing *source* package ‘ModelMetrics’ ...
#> ** package ‘ModelMetrics’ successfully unpacked and MD5 sums checked
#> ** using staged installation
#> ** libs
#> /usr/bin/clang++ -std=gnu++11 -I"/opt/R-devel/lib64/R/include" -DNDEBUG -I'/home/docker/R/Rcpp/include' -I/usr/local/include -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
#> /usr/bin/clang++ -std=gnu++11 -I"/opt/R-devel/lib64/R/include" -DNDEBUG -I'/home/docker/R/Rcpp/include' -I/usr/local/include -fpic -g -O2 -c auc_.cpp -o auc_.o
#> auc_.cpp:2:10: fatal error: 'omp.h' file not found
#> #include
#> ^~~~~~~
#> 1 error generated.
#> make: *** [/opt/R-devel/lib64/R/etc/Makeconf:178: auc_.o] Error 1
#> ERROR: compilation failed for package ‘ModelMetrics’

Dockerizing a project (2/3)

  • The older the renv.lock file, the harder to restore!
  • Gets very complicated if you add Python and/or other tools.

Dockerizing a project (3/3)

  • ALSO! Image build process not reproducible per se, only running containers is
  • YOU need to make sure build process is reproducible (or store the built images)
    1. Need to fix version of R
    2. Base image layer becomes unsupported at some point

Docker conclusion

  • Docker is very useful and widely used
  • But the entry cost is high (familiarity with Linux is recommended)
  • Single point of failure (what happens if Docker gets bought, abandoned, etc? quite unlikely though)
  • Not actually dealing with reproducibility per se, we’re “abusing” Docker in a way

The Nix package manager (1/2)

  • Package manager: tool to install and manage packages

  • Package: any piece of software (not just R packages)

  • Example of popular package manager:

Google Play Store

The Nix package manager (2/2)

  • For total reproducibility: R, R packages and other dependencies must be managed
  • Nix deals with everything, with one single text file (called a Nix expression)!
  • Nix is a package manager actually focused on reproducible builds
  • These Nix expressions always build the exact same output

A basic Nix expression (1/6)

let
  pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
  system_packages = builtins.attrValues {
    inherit (pkgs) R ;
  };
in
  pkgs.mkShell {
    buildInputs = [ system_packages ];
    shellHook = "R --vanilla";
  }

There’s a lot to discuss here!

A basic Nix expression (2/6)

  • Written in the Nix language (not discussed)
  • Defines the repository to use (with a fixed revision)
  • Lists packages to install
  • Defines the output: a development shell

A basic Nix expression (3/6)

  • Software for Nix is defined as a mono-repository of tens of thousands of expressions on GitHub
  • GitHub: we can use any commit to pin package versions for reproducibility!
  • For example, the following commit installs R 4.3.1 and associated packages:
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};

A basic Nix expression (4/6)

  • system_packages: a variable that lists software to install
  • In this case, only R:
system_packages = builtins.attrValues {
  inherit (pkgs) R ;
};

A basic Nix expression (5/6)

  • Finally, we define a shell:
pkgs.mkShell {
  buildInputs = [ system_packages ];
  shellHook = "R --vanilla";
}
  • This shell will come with the software defined in system_packages (buildInputs)
  • And launch R --vanilla when started (shellHook)

A basic Nix expression (6/6)

  • Writing these expressions requires learning a new language
  • While incredibly powerful, if all we want are per-project reproducible dev shells…
  • …then {rix} will help!

Nix expressions

  • Nix expressions can be used to install software
  • But we will use them to build per-project development shells
  • We will include R, LaTeX packages, or Quarto, Python, Julia….
  • Nix takes care of installing every dependency down to the compiler!

rix: reproducible development environments with Nix (1/5)

  • {rix} (website) makes writing Nix expressions easy!
  • Simply use the provided rix() function:
library(rix)

rix(
  date = "2025-02-17",
  r_pkgs = "ggplot2",
  py_conf = list(
    py_version = "3.12",
    py_pkgs = c("polars", "great-tables")
  ),
  overwrite = TRUE
)

rix: reproducible development environments with Nix (2/5)

  • renv.lock files can also be used as starting points:
library(rix)

renv2nix(
  renv_lock_path = "path/to/original/renv_project/renv.lock",
  project_path = "path/to/rix_project",
  override_r_ver = "4.4.1" # <- optional
)

rix: reproducible development environments with Nix (3/5)

  • List required R version and packages
  • Optionally: more system packages, packages hosted on GitHub, or LaTeX packages
  • Optionally: an IDE (RStudio, Radian, VS Code or “other”)
  • Work interactively in an isolated, project-specific and reproducible environment!

rix: reproducible development environments with Nix (4/5)

  • Time for a demonstration, see scripts/nix_expressions/docker/
  • First outside of Docker
  • Then we dockerize

(you’ll find many other examples in the repository)

rix: reproducible development environments with Nix (5/5)

  • Can install specific versions of packages (write "dplyr@1.0.0")
  • Can install packages hosted on GitHub
  • Many vignettes to get you started! See here

Non-interactive use

  • {rix} makes it easy to run pipelines in the right environment
  • (Little side note: the best tool to build pipelines in R is {targets})
  • See scripts/nix_expressions/nix_targets_pipeline
  • Can also run the pipeline like so:
cd /absolute/path/to/pipeline/ && nix-shell default.nix --run "Rscript -e 'targets::tar_make()'"

Nix and GitHub Actions: running pipelines

  • Possible to easily run a {targets} pipeline on GitHub actions
  • Simply run rix::tar_nix_ga() to generate the required files
  • Commit and push, and watch the actions run!
  • See here.

Nix and GitHub Actions: writing papers

  • Easy collaboration on papers as well
  • See here
  • Just focus on writing!

Conclusion

  • Very vast and complex topic!
  • At the very least, generate an renv.lock file
  • Always possible to rebuild a Docker image in the future (either you, or someone else!)
  • Consider using {targets}: not only good for reproducibility, but also an amazing tool all around
  • Long-term reproducibility: must use Nix (with or without Docker) or store Docker images
  • Maybe check out my other packages (in early dev): {rixpress}

The end

Contact me if you have questions:

  • bruno@brodrigues.co
  • Twitter: @brodriguesco
  • Mastodon: @brodriguesco@fosstodon.org
  • Blog: www.brodrigues.co
  • Book: www.raps-with-r.dev
  • rix: https://docs.ropensci.org/rix

Thank you!