Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Catalog

The catalog is a plain Rust struct that groups all datasets used by a pipeline. Any struct that derives Serialize and Deserialize and contains dataset fields works as a catalog β€” there is no special trait to implement.

Dataset fields

Each field in the catalog is a dataset. The field names become the dataset names used in logging, visualization, and error messages. The framework discovers them automatically via serde serialization.

#[derive(Serialize, Deserialize)]
struct Catalog {
    readings: PolarsCsvDataset,
    summary: MemoryDataset<f64>,
    report: JsonDataset,
}

Nested catalogs

Catalogs can nest other structs for organization. This is useful when a pipeline has many datasets that fall into logical groups:

#[derive(Serialize, Deserialize)]
struct InputData {
    raw_readings: PolarsCsvDataset,
    reference: YamlDataset,
}

#[derive(Serialize, Deserialize)]
struct Catalog {
    input: InputData,
    output: JsonDataset,
    intermediate: MemoryDataset<f64>,
}

The discovered dataset names use dot-separated paths: input.raw_readings, input.reference, output, and intermediate. These names appear in logs, hooks, and the viz dashboard.

The corresponding YAML mirrors the nesting:

input:
  raw_readings:
    path: data/raw.csv
  reference:
    path: data/ref.yml
output:
  path: data/output.json
intermediate: {}

Naming conventions

The framework uses serde struct names to distinguish leaf datasets from container structs:

  • Types whose serde name ends with "Dataset" are treated as leaf datasets β€” the indexer stops recursing
  • Param is treated as a leaf by name
  • All other struct names are treated as containers β€” the indexer recurses into their fields

This means custom dataset types should follow the *Dataset naming convention (e.g. TextDataset, MyCustomDataset). Container structs like nested catalogs or parameter groups must not end with β€œDataset”.

Catalog overrides

Dataset configuration can be overridden from the CLI using dot notation:

$ my_app run --catalog output.path=/tmp/result.json
$ my_app run --catalog input.raw_readings.separator=";"

See YAML Configuration for the full details on overrides.