Polyglot pipelines and literate programming with Quarto or R Markdown • rixpress

This vignette demonstrates how to build a polyglot pipeline and assumes you’ve read vignette("b-core-functions").

For a video version of this vignette, click here.

You can find all the code of this example here. The built Quarto document can be viewed here (the pipeline in this vignette is a slightly simplified version). For the Rmd version, look here.

For various other examples of polyglot pipelines, check out the folder labeled python_r in this github repository.

Analysing the mtcars dataset using R and Python

rixpress makes it easy to write polyglot (multilingual) data science pipelines with derivations that run R or Python code. This vignette explains how you can easily set up such a pipeline.

Let’s assume that you only have Nix installed on your system, and no R installation (this is the ideal scenario: if you plan to use Nix full-time for your development environments, you shouldn’t have a system-wide installation of R).

Before installing R and R packages for your pipeline, install cachix and configure the rstats-on-nix cache. This way, pre-compiled, binary packages will be used instead of being built from source. Run the following line in a terminal:

nix-env -iA cachix -f https://cachix.org/api/v1/install

then use the cache:

cachix use rstats-on-nix

There might be a message telling you to add your user to a configuration file by executing another command. If so, follow the instructions; you only need to do this once per machine you want to use rixpress on. Many thanks to Cachix for sponsoring the rstats-on-nix cache!

Now that the cache is configured, it’s time to bootstrap your development environment. Run this line:

nix-shell --expr "$(curl -sl https://raw.githubusercontent.com/ropensci/rix/main/inst/extdata/default.nix)"

This will drop you into a temporary shell with R and both rix and rixpress available. Simply start R by typing R, and load rixpress and call rxp_init() which will generate two files, gen-env.R and gen-pipeline.R. You can open gen-env.R in your favourite text editor and define the execution environment there:

library(rix)

rix(
  date = "2025-03-31",
  r_pkgs = c("dplyr", "igraph", "reticulate", "quarto"),
  git_pkgs = list(
    package_name = "rixpress",
    repo_url = "https://github.com/b-rodrigues/rixpress",
    commit = "HEAD"
  ),
  py_conf = list(
    py_version = "3.12",
    py_pkgs = c("pandas", "polars", "pyarrow")
  ),
  ide = "none",
  project_path = ".",
  overwrite = TRUE
)

Notice the py_conf argument to rix(): this will install Python and the listed Python packages in that environment. You’ll notice that we add reticulate to the list of R packages to install as well; this is primarily for converting data between R and Python if you’re not using a universal format like JSON. Python build steps are executed in a standard Python shell and do not require reticulate for Python code execution itself, so if you’re only using JSON to transfer data, reticulate is not required.

Now that you defined the execution environment of the pipeline, you can run the gen-env.R script, still from the temporary Nix shell by running source("gen-env.R"). This will generate the required default.nix. Then, quit R and the temporary shell (CTRL-D or quit() in R, exit in the terminal) and then build the environment defined by the freshly generated default.nix by typing nix-build. This will now build the execution environment of the pipeline. You can use this environment to work on your project interactively as usual. To learn more, check out {rix}.

You can now edit the pipeline script in gen-pipeline.R:

library(rixpress)
library(igraph)

list(
  rxp_py_file(
    name = mtcars_pl,
    path = 'data/mtcars.csv',
    read_function = "lambda x: polars.read_csv(x, separator='|')"
  ),

  rxp_py(
    # reticulate doesn't support polars DFs yet, so need to convert
    # first to pandas DF
    name = mtcars_pl_am,
    py_expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
  ),

  rxp_py2r(
    name = mtcars_am,
    expr = mtcars_pl_am
  ),

  rxp_r(
    name = mtcars_head,
    expr = my_head(mtcars_am),
    additional_files = "functions.R"
  ),

  rxp_r2py(
    name = mtcars_head_py,
    expr = mtcars_head
  ),

  rxp_py(
    name = mtcars_tail_py,
    py_expr = 'mtcars_head_py.tail()'
  ),

  rxp_py2r(
    name = mtcars_tail,
    expr = mtcars_tail_py
  ),

  rxp_r(
    name = mtcars_mpg,
    expr = dplyr::select(mtcars_tail, mpg)
  ),

  rxp_qmd(
    name = page,
    qmd_file = "my_doc/page.qmd",
    additional_files = c("my_doc/content.qmd", "my_doc/images")
  )
) |>
  rixpress(project_path = ".")

As you can see, it starts by reading in some data using the Python polars package, and then converts it to an R data frame for further manipulation, converts it back to a Python data frame and back to R. You’ll notice that at some point the head of the data is computed using a user-defined function called my_head(). User-defined functions should all go into a script called functions.R or functions.py and derivations that use them need to be aware of them by setting the additional_files argument (if derivations need further files to be available, these can also be specified there. A main difference between rxp_py() and rxp_r() is that Python code should be passed as a string, and not as an expression. Also, you’ll notice that I had to use polars.read_csv() instead of the more common pl.read_csv(). This is because by default Python packages get imported using a simple statement import polars. If you want to change this to import polars as pl (import pandas as pd and so on), then you can use the adjust_import() function. For example:

adjust_import("import polars", "import polars as pl")

adjust_import() is sometimes mandatory, for example if you want to import a package’s submodule:

adjust_import("import pillow", "from PIL import Image")

The package is called pillow, so rixpress will write the statement as import pillow, but this will simply not work. So in this case adjust_import() must be used.

If you want to use JSON to transfer data between derivations, you should use the serialize_function and unserialize_function arguments respectively:

library(rixpress)
library(igraph)

list(
  rxp_py_file(
    name = mtcars_pl,
    path = "data/mtcars.csv",
    read_function = "lambda x: polars.read_csv(x, separator='|')"
  ),

  rxp_py(
    name = mtcars_pl_am,
    py_expr = "mtcars_pl.filter(polars.col('am') == 1)",
    additional_files = "functions.py",
    serialize_function = "serialize_to_json",
  ),

  rxp_r(
    name = mtcars_head,
    expr = my_head(mtcars_pl_am),
    additional_files = "functions.R",
    unserialize_function = "jsonlite::fromJSON"
  ),

  rxp_r(
    name = mtcars_mpg,
    expr = dplyr::select(mtcars_head, mpg)
  )
) |>
  rixpress(project_path = ".", build = FALSE)


# Plot DAG for CI
dag_for_ci()

The Python serialize_to_json function is defined in the functions.py script and looks like this:

def serialize_to_json(pl_df, path):
    with open(path, 'w') as f:
        f.write(pl_df.write_json())

The serialize_function and unserialize_function arguments can be used to serialize objects using any function, for example qs::save() or machine learning-specific functions for specific models, such as those from xgboost.

Building a Quarto or R Markdown document

The last pipeline I want to discuss builds a Quarto document using rxp_qmd() (use rxp_rmd() for an R Markdown document). Here again, the additional_files argument is used to make the derivation aware of required files to build the document. Here is what the source of the document looks like:


---
title: "Loading derivations outputs in a quarto doc"
format:
  html:
    embed-resources: true
    toc: true
---

![Meme](images/meme.png)

Use `rxp_read()` to show object in the document:

```
#| eval: true

rixpress::rxp_read("mtcars_head")
```

```
#| eval: true

rixpress::rxp_read("mtcars_tail")
```

```
#| eval: true

rixpress::rxp_read("mtcars_mpg")
```

{{< include content.qmd >}}

```
#| eval: true

rixpress::rxp_read("mtcars_tail_py")
```

Just like in an interactive session, rxp_read() is used to retrieve the objects from the store. See how I refer to the other document content.qmd and the image meme.png.

If you want to add further arguments to the Quarto command line tool, you can use the args argument:

rxp_qmd(
  name = page,
  qmd_file = "my_doc/page.qmd",
  additional_files = c("my_doc/content.qmd", "my_doc/images"),
  args = "--to typst"
)

and don’t forget to add typst to the list of system packages in the call to rix():

rix(
  date = "2025-03-31",
  r_pkgs = c("dplyr", "igraph", "reticulate", "quarto"),
  system_pkgs = "typst",
  git_pkgs = list(...

In the future, other languages could be added to rixpress, notably Julia.

For more examples, check out rixpress_demos repository. These examples demonstrate additional features of rixpress, including:

and many others! Don’t hesitate to submit more examples as well!