Encoding, Decoding, and Cross-Language Data Transfer
Source:vignettes/encoding-decoding.Rmd
encoding-decoding.Rmd
Introduction
Data pipelines in rixpress often require controlling how objects are stored and restored, especially when dealing with:
- Non-standard R objects (e.g., machine learning models, large tables).
- Multiple file formats (CSV, qs compressed files, etc.).
- Cross-language workflows mixing R and Python.
This vignette focuses on encoding and decoding in R,
and on transferring data between R and Python using
rxp_py2r()
and rxp_r2py()
.
Custom Encoding and Decoding in R
By default, rixpress uses saveRDS()
and
readRDS()
. You can override this to handle different
formats or complex objects:
library(rixpress)
# Encode output as CSV instead of RDS
d2 <- rxp_r(
mtcars_head,
my_head(mtcars_am, 100),
user_functions = "my_head.R",
nix_env = "default.nix",
encoder = write.csv
)
# Encode as qs, decode input from CSV
d3 <- rxp_r(
mtcars_tail,
my_tail(mtcars_head),
user_functions = "my_tail.R",
nix_env = "default2.nix",
encoder = qs::qsave,
decoder = read.csv
)
# Decode multiple upstream objects with different decoders
d4 <- rxp_r(
mtcars_mpg,
full_join(mtcars_tail, mtcars_head),
nix_env = "default2.nix",
decoder = c(
mtcars_tail = "qs::qread",
mtcars_head = "read.csv"
)
)
Key points:
-
encoder
controls how this step’s output is stored. -
decoder
specifies how to read inputs from upstream derivations. - You can assign different decoders per upstream object using a named vector.
As shown in the examples above, you can pass a function or a string
representation of the function to encoder
and
decoder
.
By encoding the object in a cross-language format, it is possible to pass it to another language. For example, read a csv file using Julia, encode it to Arrow and read it back in R:
library(rixpress)
list(
rxp_jl_file(
mtcars,
# Assume here that mtcars.csv is separated by "|" instead of ","
path = "data/mtcars.csv",
read_function = "read_csv",
user_functions = "functions.jl",
encoder = "write_arrow"
# read_csv and write_arrow are both
# defined in the functions.jl script
# and looks like this:
#function write_arrow(df::DataFrame, filename::String)
# Arrow.write(filename, df)
#end
#function read_csv(path::String)
# df = CSV.read(path, DataFrame; delim="|")
#return df
#end
),
rxp_r(
mtcars2,
select(mtcars, am, cyl, mpg),
decoder = "read_feather"
)
) |>
rxp_populate()
You can find this example here. You can use the same approach to transfer data to Python (well, from and to any of the three supported languages).
Cross-Language Data Transfer: R ↔︎ Python
In the specific case of transferring objects (data, lists, vectors,
arrays, etc.) between R and Python, it also possible to use
reticulate’s built-in conversion by using
rxp_py2r()
and rxp_r2py()
. These functions
enable seamless movement of objects between R and Python:
library(rixpress)
# Python step producing pandas DataFrame
d1 <- rxp_py(
name = mtcars_pl_am,
expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
)
# Transfer Python -> R
d2 <- rxp_py2r(
name = mtcars_am,
expr = mtcars_pl_am
)
# R step processing the data
d3 <- rxp_r(
name = mtcars_head,
expr = my_head(mtcars_am),
user_functions = "functions.R"
)
# Transfer R -> Python
d3_1 <- rxp_r2py(
name = mtcars_head_py,
expr = mtcars_head
)
For this to work, you need to add reticulate to the pipeline’s execution environment.
Summary
- Use
encoder
/decoder
for non-RDS objects (CSV, qs, Keras models) and to pass data to and from different languages. - Explicitly set decoders per upstream object to avoid mismatches.
- Use
rxp_py2r()
andrxp_r2py()
if you want to re-use reticulate’s bulit-in conversion (useful for more complex objects).