Reproducible data science with Nix, {rix} and {rixpress}

Bruno Rodrigues

Intro: Who am I

Bruno Rodrigues, head of the statistics department at the Ministry of Research and Higher Education in Luxembourg

Intro: Luxembourg?

Intro: Luxembourg?

Intro: where to find the code

Slides available online:

https://b-rodrigues.github.io/repro_ukraine

Code available here:

https://github.com/b-rodrigues/repro_ukraine

This workshop in a nutshell

The reproducibility puzzle you know

This workshop in a nutshell

The reproducibility puzzle with Nix

But why bother?

  • Easier to collaborate
  • Easier to reproduce
  • Easier to debug
  • Ultimately, peace of mind!
  • Also, see latest discussions around {ggplot2} 4.0.0

Part 1: Reproducible environments for data science with {rix}

Available solutions for R (1/2)

  • {renv} or {groundhog}: simple to use, but:
    • Doesn’t save the R version
    • Installing old packages may fail (system dependencies)
  • Docker goes further:
    • Manages R and system dependencies
    • Uses immutable and shareable img
    • Containers executable anywhere

Available solutions for R (2/2)

  • Docker limitations:
    • Learning curve (Linux knowledge recommended)
    • Not originally designed for reproducibility
    • See: Rocker

The Nix package manager (1/2)

Package manager: tool for installing and managing packages

Package: any software (not just R packages)

A popular package manager:

Google Play Store

The Nix package manager (2/2)

  • To ensure reproducibility: R, R packages, and other dependencies must be explicitly managed
  • Nix is a package manager truly focused on reproducible builds
  • Nix manages everything using a single text file (called a Nix expression)!
  • These expressions always produce exactly the same result

rix: reproducible development environments with Nix (1/5)

  • {rix} (website) simplifies writing Nix expressions!

  • Just use the provided rix() function:

library(rix)

rix(date = "2025-06-02",
    r_pkgs = c("dplyr", "ggplot2"),
    system_pkgs = NULL,
    git_pkgs = NULL,
    tex_pkgs = NULL,
    ide = "code",
    project_path = ".")

rix: reproducible development environments with Nix (2/5)

  • renv.lock files can also serve as a starting point:
library(rix)

renv2nix(
  renv_lock_path = "path/to/original/renv_project/renv.lock",
  project_path = "path/to/rix_project",
  override_r_ver = "4.4.1" # <- optional
)

rix: reproducible development environments with Nix (3/5)

  • List the R version and required packages
  • Optionally:
    • system packages, GitHub packages, or LaTeX packages
    • an IDE (RStudio, Radian, VS Code, or “other”)
    • a version of Python and Python packages to include
    • a version of Julia and Julia packages to include

rix: reproducible development environments with Nix (4/5)

  • rix::rix() generates a default.nix file
  • Build the expressions with nix-build (in terminal) or rix::nix_build() from R
  • Access the development environment with nix-shell
  • Expressions can be generated even without Nix installed (with some limitations)

rix: reproducible development environments with Nix (5/5)

  • Can install specific versions of packages (write "dplyr@1.0.0")
  • Can install packages hosted on GitHub
  • Many vignettes to get started! See here

Super quickstart

  • Uninstall R and RStudio
  • Install Nix (on win/linux or on macOS)
  • Start a temporary Nix shell to bootstrap your environments:
nix-shell -p R rPackages.rix
  • Source a gen_env.R script that generates a default.nix and build that!

Demo

  • Basics: scripts/nix_expressions/01_rix_intro/
  • renv2nix: scripts/nix_expressions/02_renv2nix/
  • Native Code/Positron on Windows: scripts/nix_expressions/03_native_vscode_example/
  • Nix and {targets}: scripts/nix_expressions/04_nix_targets_pipeline/
  • Nix and Docker: scripts/nix_expressions/05_docker/
  • Nix and {shiny}: scripts/nix_expressions/06_shiny/
  • GitHub Actions: see here

Part 2: Reproducible analytical pipelines with {rixpress}

Introduction (1/3)

  • Nix is actually more than just a mere package manager
  • Nix is complete end-to-end build tool that leverages functional programming principles to ensure reproducible builds
  • Users write Nix expressions which are then translated into Nix derivations
  • Derivation: a specification for running an executable on precisely defined input files to repeatably produce output files at uniquely determined file system paths. (source)

Introduction (2/3)

  • Essentially: a derivation is a recipe with precisely defined inputs, steps, and a fixed output.
  • Given identical inputs and build steps → always produce exact same output
  • All inputs to a derivation must be explicitly declared.
  • Inputs include not just data files, but also software dependencies, configuration flags, and environment variables, essentially anything necessary for the build process.
  • The build process takes place in a hermetic sandbox to ensure the exact same output is always produced.

Introduction (3/3)

  • {rix}: output is a shell that contains required software
  • {rixpress}: output is whatever is the output of your pipeline (cleaned dataset, Quarto/Rmd document, model predictions, model parameters/weights, model itself…)

rixpress

  • {rixpress} allows chaining processing steps in R and Python
  • Uses {rix} to create a reproducible (via Nix) execution environment for the pipeline
  • Each pipeline step is a Nix derivation
  • Data transfer: automatic via reticulate or universal format (CSV, JSON, Parquet…)

An example of a polyglot pipeline

list(
  rxp_py_file(…),    # Read a CSV with Python
  rxp_py(…),         # Filter with Polars
  rxp_py2r(…),       # Python → R transfer
  rxp_r(…),          # Transformation in R
  rxp_r2py(…),       # R → Python transfer
  rxp_py(…),         # Another Python step
  rxp_py2r(…),       # Back to R
  rxp_r(…)           # Final step
) |> rixpress()
  • Each step is named and typed (py, r, r2py, etc.)
  • Ability to add files (functions.R, img…)

Defining a derivation (input data)

rxp_r_file(
  name = mtcars,
  path = 'data/mtcars.csv',
  read_function = \(x) (read.csv(file = x, sep = "|"))
)

Defining a derivation (some computation)

rxp_r(
  name = filtered_mtcars,
  expr = filter(mtcars, am == 1)
)

Typical structure of a rixpress project

.
├── data
│   └── dataset.csv    # input data (can be many files)
├── functions.py       # user-defined Python functions
├── functions.R        # user-defined R function
├── gen-env.R          # rix script to generate execution env
├── gen-pipeline.R     # rixpress script to generate pipeline
├── my_paper           # folder containing Quarto doc
│   ├── section.qmd    # Qmd file
│   ├── img         # Folder containing img for document
│   │   └── graph.png  # Image to add to paper
│   └── main.qmd       # Main Qmd file
└── Readme.md          # Readme

Defining Quarto or Rmd documents

Use rxp_read() to include pipeline outputs:

rixpress::rxp_read("mtcars_head")

All created objects can be dynamically loaded into the document

Possible to include additional files (content.qmd, img…)

Using rixpress (1/3)

  • Start from an empty folder;
  • Drop into a temporary shell:
nix-shell --expr "$(curl -sl https://raw.githubusercontent.com/ropensci/rix/main/inst/extdata/default.nix)"

Using rixpress (2/3)

  • Start project structure using rixpress::rxp_init();
  • Edit gen-env.R and gen-pipeline.R;
  • Write code in gen-pipeline.R and build pipeline using rxp_make();
  • Inspect outputs using rxp_inspect();
  • View outputs using rxp_read() (load them using rxp_load();
  • View DAG of pipeline using rxp_ggdag() or rxp_visnetwork()

Using rixpress (3/3)

  • Build artifacts get saved and re-used across runs;
  • Possible to export and import them using import_nix_archive() and export_nix_archive() (very useful on CI!);
  • To set up GitHub Actions use rxp_ga();
  • Logs of runs get saved, and possible to reload older versions of build artifacts (see vignette).

Transfer with JSON (or other universal format)

  • Advantage: avoids using reticulate
  • Add a Python serialization function:
def serialize_to_json(pl_df, path):
    with open(path, 'w') as f:
        f.write(pl_df.write_json())
  • And on the R side:
rxp_r(
  name = "x",
  expr = my_fun(data),
  unserialize_function = "jsonlite::fromJSON"
)

Interactive demo

See scripts/rixpress_demo

To learn more:

Fin

If you have questions:

Thanks!