Partitioned Dataset
PartitionedDataset and LazyPartitionedDataset represent a directory of files where each file is treated as a separate partition.
Requires the polars feature.
PartitionedDataset
Eagerly loads all files in a directory into a HashMap<String, D::LoadItem>:
pub struct PartitionedDataset<D: FileDataset> {
pub path: String,
pub ext: String,
pub dataset: D,
}
path— the directory to read from / write toext— file extension to filter by (e.g."csv","parquet")dataset— a template dataset that is cloned and pointed at each file
Loading
Returns HashMap<String, D::LoadItem> where keys are filename stems:
data/partitions/
january.csv
february.csv
march.csv
// loads as HashMap { "january" => DataFrame, "february" => DataFrame, "march" => DataFrame }
Saving
Accepts HashMap<String, D::SaveItem> and writes each entry as {name}.{ext}:
Node {
name: "split_by_month",
func: |df: DataFrame| -> (HashMap<String, DataFrame>,) {
// split DataFrame into partitions...
},
input: (&cat.all_data,),
output: (&cat.monthly,), // PartitionedDataset<PolarsCsvDataset>
}
LazyPartitionedDataset
Same as PartitionedDataset but returns HashMap<String, Lazy<D::LoadItem>> — each partition is loaded on demand:
Node {
name: "process",
func: |partitions: HashMap<String, Lazy<DataFrame>>| {
// only load the partitions you need
let jan = partitions["january"].load().unwrap();
// ...
},
input: (&cat.monthly,),
output: (&cat.result,),
}
Lazy<T> wraps a closure that calls dataset.load() when .load() is called on it.
YAML configuration
monthly:
path: data/partitions
ext: csv
dataset:
separator: ","
has_header: true
The dataset field configures the template dataset that is cloned for each partition file.
FileDataset requirement
The inner dataset type must implement FileDataset:
pub trait FileDataset: Dataset + Clone {
fn path(&self) -> &str;
fn set_path(&mut self, path: &str);
}
Built-in types that implement FileDataset: PolarsCsvDataset, PolarsParquetDataset, TextDataset, JsonDataset.