Cache Dataset
CacheDataset<D> wraps any dataset and caches the loaded/saved value in memory. Subsequent loads return the cached value without hitting the underlying dataset.
Requires the std feature.
Definition
pub struct CacheDataset<D: Dataset> {
pub dataset: D,
cache: Arc<Mutex<Option<D::LoadItem>>>,
}
Usage
Wrap any dataset to add caching:
#[derive(Serialize, Deserialize)]
struct Catalog {
readings: CacheDataset<PolarsCsvDataset>,
}
readings:
dataset:
path: data/readings.csv
separator: ","
Behavior
- First
load()β delegates to the inner dataset, caches the result, returns it - Subsequent
load()calls β returns the cached value without re-reading the file save()β writes to the inner dataset and updates the cachehtml()β delegates to the inner dataset
When to use
Use CacheDataset when a dataset is read by multiple nodes and the underlying I/O is expensive:
(
Node { name: "analyze", input: (&cat.readings,), .. },
Node { name: "validate", input: (&cat.readings,), .. },
Node { name: "summarize", input: (&cat.readings,), .. },
)
Without caching, readings would be loaded from disk three times. With CacheDataset<PolarsCsvDataset>, itβs loaded once and served from memory for the remaining reads.
Constraints
The inner dataset must satisfy:
D::LoadItem: Cloneβ so the cached value can be cloned on each loadD::SaveItem: Clone + Into<D::LoadItem>β so saves can update the cachePondError: From<D::Error>β for error conversion