Parallel Runner

The ParallelRunner executes independent pipeline nodes concurrently using scoped threads.

Requires the std feature.

Overview

#[derive(Default)]
pub struct ParallelRunner;

impl Runner for ParallelRunner {
    fn name(&self) -> &'static str { "parallel" }
    // ...
}

How it works

Builds a dependency graph from the pipeline using build_pipeline_graph()
Identifies source datasets — params and external inputs that are available immediately
Schedules nodes as soon as all their input datasets have been produced
Tracks produced datasets — when a node completes, its output datasets become available, potentially unblocking other nodes
Uses std::thread::scope for safe scoped threads — no 'static bounds needed

Usage

$ my_app run --runner parallel

Or programmatically:

use pondrs::runners::ParallelRunner;

App::new(catalog, params)
    .with_runners((SequentialRunner, ParallelRunner))
    .execute(pipeline)?;

Dependency analysis

The parallel runner determines which nodes can run concurrently by analyzing dataset dependencies:

    param
    /    \
  [a]    [b]     ← a and b can run in parallel (both read param)
    \    /
     [c]         ← c waits for both a and b to complete

Only data dependencies matter — the tuple ordering in your pipeline function is irrelevant to the parallel runner.

Error handling

On the first node failure:

on_node_error fires for the failed node
on_pipeline_error fires for all ancestor pipelines
No new nodes are scheduled
Already-running nodes are allowed to complete (drain)
The first error is returned

Pipeline hooks

Pipeline hooks behave differently than in the sequential runner:

before_pipeline_run fires when all of the pipeline’s declared inputs are available
after_pipeline_run fires when all of the pipeline’s declared outputs have been produced

This means pipeline hooks may fire at different points in wall-clock time compared to sequential execution.

Thread safety requirements

Because nodes run on different threads:

Hooks must be Sync (use Mutex or atomics for mutable state)
Datasets used concurrently must support concurrent access. MemoryDataset uses Arc<Mutex<_>> and is safe. CellDataset is not safe for parallel use.

When to use

Use the parallel runner when:

Your pipeline has independent branches that can run concurrently
Nodes involve I/O (file reads, network calls) where parallelism helps
You’re in a std environment with thread support

Keyboard shortcuts

🤔 pondrs