A crucial first step in any data analysis pipeline is importing data.
The rixpress package provides a flexible set of
functions, rxp_r_file
, rxp_py_file
, and
rxp_jl_file
, to handle various data import scenarios in a
reproducible way. This vignette will guide you through the common use
cases.
For more examples, check out the rixpress_demos repository.
Importing a single local file
The most straightforward case is reading a single data file from your
local project directory. You need to provide a name
for the
resulting R object, the path
to the file, and a
read_function
to process it.
library(rixpress)
list(
rxp_r_file(
name = mtcars,
path = 'data/mtcars.csv',
read_function = \(x) (read.csv(file = x, sep = "|"))
),
...
In this example, rxp_r_file
creates a derivation
that:
- Copies
data/mtcars.csv
into a sandboxed build environment. - Executes the provided anonymous function,
\(x) (read.csv(file = x, sep = "|"))
, wherex
is the path to the copied file inside the sandbox. - Saves the resulting data frame as an object named
mtcars
for subsequent steps in the pipeline.
Importing a single file from the internet
You can also directly import a file from a URL. Simply provide the
URL as the path
. rixpress handles the
download and ensures reproducibility by caching the file using its
cryptographic hash.
library(rixpress)
list(
rxp_r_file(
name = mtcars,
path = 'https://raw.githubusercontent.com/b-rodrigues/rixpress_demos/refs/heads/master/basic_r/data/mtcars.csv',
read_function = \(x) (read.csv(file = x, sep = "|"))
),
...
Behind the scenes, rixpress uses Nix to fetch the
file, ensuring that the exact same version of the file is used every
time the pipeline is run. This is the only time the build sandbox can
access a remote file: it’s because the file actually gets downloaded by
Nix ahead of time. If you need to access data in real-time from an API,
then you’ll need to download the data yourself outside of
rixpress pipeline, and then import it in the pipeline
using rxp_r_file()
.
Importing many files from a directory
Often, you need to import and combine multiple files from a single
directory. To do this, set the path
argument to the
directory’s path. Your read_function
will then receive the
path to this directory inside the build environment and must contain the
logic to handle all the files within.
Here is an example in R that reads all files in the data
directory:
library(rixpress)
list(
rxp_r_file(
name = mtcars_r,
path = 'data',
read_function = \(x) {
(readr::read_delim(list.files(x, full.names = TRUE), delim = '|'))
}
)
) |>
rxp_populate(project_path = ".")
And here’s a similar example using Python, which calls a user-defined
function read_many_csvs
from an external script:
library(rixpress)
list(
rxp_py_file(
name = mtcars_py,
path = 'data',
read_function = "read_many_csvs",
user_functions = "functions.py"
)
) |>
rxp_populate(project_path = ".")
Here is what the Python function looks like:
import polars
from pathlib import Path
def read_many_csvs(dir_path):
folder = Path(dir_path)
csv_files = folder.glob("*.csv")
return polars.concat([polars.read_csv(f) for f in csv_files])
In both cases, the entire data
directory is copied into
the build sandbox, and the read_function
is responsible for
listing the files and reading them.
Importing files with dependencies (e.g., Shapefiles)
Some file formats, like the ESRI Shapefile, consist of multiple
“sidecar” files (e.g., .shp
, .shx
,
.dbf
) that must be present together for the data to be read
correctly. Even though you might only point the read function to the
.shp
file, the other component files need to be in the same
directory.
rixpress handles this by allowing you to specify a
directory as the path
. This ensures all necessary files are
copied into the build environment. However, you must then provide the
full path to the main file inside the build environment within
your read_function
.
In a rixpress pipeline, local files and directories
specified in path
are copied into a sub-directory called
input_folder
. Therefore, the path to your data inside the
Nix sandbox will be input_folder/YOUR_PATH
.
The following example shows how to read a shapefile using Python and
geopandas
:
library(rixpress)
list(
rxp_py_file(
name = gdf,
# We provide the directory 'data' to ensure all shapefile components are copied.
path = 'data',
# The read_function must use the hardcoded path within the build environment.
read_function = "lambda x: geopandas.read_file('input_folder/data/oceans.shp', driver='ESRI Shapefile')"
),
rxp_py(
name = sa,
expr = "gdf.loc[gdf['Oceans'] == 'South Atlantic Ocean']['geometry'].loc[0]"
)
) |>
rxp_populate(project_path = ".")
Here’s what happens:
- The
path = 'data'
argument tells rixpress to copy the entiredata
directory into the sandbox. - Inside the sandbox, the shapefile is located at
input_folder/data/oceans.shp
. - The
read_function
is a lambda function that explicitly callsgeopandas.read_file
with this hardcoded path, allowing it to find the.shp
file and its necessary sidecar files.
A perhaps cleaner alternative is to write a function that takes the
path to the data folder as an input, and then have this function look in
that folder for the shapefile, and pass its path to
geopandas.read_file
. For example
def read_shp(path_folder):
# Look for files ending with .shp in the given folder
candidates = glob.glob(os.path.join(path_folder, "*.shp"))
if not candidates:
raise FileNotFoundError(f"No .shp file found in {path_folder}")
shapefile = candidates[0]
return gpd.read_file(shapefile, driver="ESRI Shapefile")
We can then rewrite the derivation like so:
rxp_py_file(
name = gdf,
path = 'data',
read_function = "read_shp",
user_functions = "functions.py"
),
(assuming our function is defined in a script called
functions.py
).
Because our Python function also uses glob
and
os
, we need to import these functions using
add_import()
. We can add this just after calling
rxp_populate()
:
rxp_populate(
project_path = ".",
py_imports = c(geopandas = "import geopandas as gpd")
)
# This is needed for the function defined in functions.py
add_import("import os", "default.nix")
add_import("import glob", "default.nix")
Conclusion
The rxp_*_file
functions in rixpress
offer a powerful and consistent interface for ingesting data into your
reproducible pipelines, whether your data lives locally, on the web, as
a single file, or as a collection of files. By understanding how to
specify the path
and tailor the read_function
,
you can handle a wide variety of data import tasks.