Polyglot pipelines and literate programming with Quarto or R Markdown
Source:vignettes/c-polyglot.Rmd
c-polyglot.Rmd
This vignette demonstrates how to build a polyglot pipeline and
assumes you’ve read vignette("b-basic-usage")
.
For a video version of this vignette, CHECK OUT THIS UPCOMING VIDEO ON YOUTUBE
You can find all the code of this example here. The built Quarto document can be viewed here (the pipeline in this vignette is a slightly simplified version). For the Rmd version, look here.
For various other examples of polyglot pipelines, check out the
folder labeled python_r
in this github
repository.
Analysing the mtcars dataset using R and Python
rixpress makes it easy to write polyglot (multilingual) data science pipelines with derivations that run R or Python code. This vignette explains how you can easily set up such a pipeline.
Let’s assume that you only have Nix installed on your system, and no R installation (this is the ideal scenario: if you plan to use Nix full-time for your development environments, you shouldn’t have a system-wide installation of R).
Before installing R and R packages for your pipeline, install cachix and configure the
rstats-on-nix
cache. This way, pre-compiled, binary,
packages will be used instead of being built from source. Run the
following line in a terminal:
then use the cache:
There might be a message telling you to add your user to a
configuration file by executing another command. Do it; you only need to
do this once per machine you want to use rixpress on.
Many thanks to Cachix for
sponsoring the rstats-on-nix
cache!
Now that the cache is configured, it’s time to bootstrap your development environment. Run this line:
nix-shell --expr "$(curl -sl https://raw.githubusercontent.com/ropensci/rix/main/inst/extdata/default.nix)"
This will drop you into a temporary shell with R and both
rix and rixpress available. Simply start R
by typing R
, and load rixpress and call
rxp_init()
which will generate two files,
gen-env.R
and gen-pipeline.R
. You can open
gen-env.R
in your favourite text editor and define the
execution environment in there:
library(rix)
rix(
date = "2025-03-31",
r_pkgs = c("dplyr", "igraph", "reticulate", "quarto"),
git_pkgs = list(
package_name = "rixpress",
repo_url = "https://github.com/b-rodrigues/rixpress",
commit = "HEAD"
),
py_conf = list(
py_version = "3.12",
py_pkgs = c("pandas", "polars", "pyarrow")
),
ide = "none",
project_path = ".",
overwrite = TRUE
)
Notice the py_conf
argument to rix()
: this
will install Python and the listed Python packages in that environment.
You’ll notice that we add reticulate to the list of R
packages to install as well; this is only needed to convert data between
R and Python, and only if you can’t use JSON to pass objects to and from
Python or R. Python build steps are executed in a standard Python shell
without the need to use reticulate, so if you’re using
JSON to transfer data, there is no need to install
reticulate.
Now that you defined the execution environment of the pipeline, you
can run that script, still from the temporary Nix shell by running
source("gen-env.R")
. This will generate the required
default.nix
. Then, quit R and the temporary shell (CTRL-D
or quit()
in R, exit
in the terminal) and then
build the environment defined by the freshly generated
default.nix
by typing nix-build
. This will now
build the execution environment of the pipeline. You can use this
environment to work on your project interactively as usual as well. To
learn more, check out {rix}
.
You can now edit the pipeline script in
gen-pipeline.R
:
library(rixpress)
library(igraph)
list(
rxp_py_file(
name = mtcars_pl,
path = 'data/mtcars.csv',
read_function = "lambda x: polars.read_csv(x, separator='|')"
),
rxp_py(
# reticulate doesn't support polars DFs yet, so need to convert
# first to pandas DF
name = mtcars_pl_am,
py_expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
),
rxp_py2r(
name = mtcars_am,
expr = mtcars_pl_am
),
rxp_r(
name = mtcars_head,
expr = my_head(mtcars_am),
additional_files = "functions.R"
),
rxp_r2py(
name = mtcars_head_py,
expr = mtcars_head
),
rxp_py(
name = mtcars_tail_py,
py_expr = 'mtcars_head_py.tail()'
),
rxp_py2r(
name = mtcars_tail,
expr = mtcars_tail_py
),
rxp_r(
name = mtcars_mpg,
expr = dplyr::select(mtcars_tail, mpg)
),
rxp_quarto(
name = page,
qmd_file = "my_doc/page.qmd",
additional_files = c("my_doc/content.qmd", "my_doc/images")
)
) |>
rixpress(project_path = ".")
As you can see, it starts off by reading in some data using the
Python polars package, and then converts it to an R data frame for
further manipulation, converts it back to a Python data frame and back
to R. You’ll notice that at some point the head of the data is
computed using a user-defined function called my_head()
.
User-defined functions should all go into a script called
functions.R
or functions.py
and derivation
that use them need to be aware of them by setting the
additional_files
argument (if derivation need further files
to be available, these can be put there as well. A main difference
between rxp_py()
and rxp_r()
is that Python
code should be passed as a string, and not as an expression. Also,
you’ll notice that I had to use polars.read_csv()
instead
of the more common pl.read_csv()
. This is because by
default Python package get imported using a simple statement
import polars
. If you want to change this to
import polars as pl
(import pandas as pd
and
so on), then you can use the adjust_import()
function. For
example:
adjust_import("import polars", "import polars as pl")
adjust_import()
is sometimes mandatory, for example if
you want to import a package’s submodule:
adjust_import("import pillow", "from PIL import Image")
The package is called pillow
, so rixpress
will import write the statement as import pillow
, but this
will simply not work. So in this case adjust_import()
must
be used.
If you want to use JSON to transfer data between derivations, you
should use the serialize_function
and
unserialize_function
arguments respectively:
library(rixpress)
library(igraph)
list(
rxp_py_file(
name = mtcars_pl,
path = "data/mtcars.csv",
read_function = "lambda x: polars.read_csv(x, separator='|')"
),
rxp_py(
name = mtcars_pl_am,
py_expr = "mtcars_pl.filter(polars.col('am') == 1)",
additional_files = "functions.py",
serialize_function = "serialize_to_json",
),
rxp_r(
name = mtcars_head,
expr = my_head(mtcars_pl_am),
additional_files = "functions.R",
unserialize_function = "jsonlite::fromJSON"
),
rxp_r(
name = mtcars_mpg,
expr = dplyr::select(mtcars_head, mpg)
)
) |>
rixpress(project_path = ".", build = FALSE)
# Plot DAG for CI
dag_for_ci()
The Python serialize_to_json
function is defined in the
functions.py
script and looks like this:
def serialize_to_json(pl_df, path):
with open(path, 'w') as f:
f.write(pl_df.write_json())
The serialize_function
and
unserialize_function
arguments can be used to serialize
objects using any function, for example qs::save()
or
machine learning-specific functions for specific models such as
xgboost
.
Building a Quarto or R Markdown document
The last pipeline I want to discuss builds a Quarto document using
rxp_quarto()
(use rxp_rmd()
for a R Markdown
document). Here again, the additional_files
argument is
used to make the derivation aware of required files to build the
document. Here is what the source of the document looks like:
---
title: "Loading derivations outputs in a quarto doc"
format:
html:
embed-resources: true
toc: true
---

Use `rxp_read()` to show object in the document:
```
#| eval: true
rixpress::rxp_read("mtcars_head")
```
```
#| eval: true
rixpress::rxp_read("mtcars_tail")
```
```
#| eval: true
rixpress::rxp_read("mtcars_mpg")
```
{{< include content.qmd >}}
```
#| eval: true
rixpress::rxp_read("mtcars_tail_py")
```
Just like in an interactive session, rxp_read()
is used
to retrieve the objects from the store. See how I refer to the other
document content.qmd
and the image
meme.png
.
If you want to add further arguments to the Quarto command line tool,
you can use the args
argument:
rxp_quarto(
name = page,
qmd_file = "my_doc/page.qmd",
additional_files = c("my_doc/content.qmd", "my_doc/images"),
args = "--to typst"
)
and don’t forget to add typst
to the list of system
packages in the call to rix()
:
rix(
date = "2025-03-31",
r_pkgs = c("dplyr", "igraph", "reticulate", "quarto"),
system_pkgs = "typst",
git_pkgs = list(...
In the future, other languages could be added to rixpress, notably Julia.
For an examples that use the Python xgboost
library to
train a machine learning model, see this
example.