Orchestrator
The orchestrator is the coordinating component of the distributed computing system.
Its role is to take large computational jobs, decompose them into many small, independent tasks, distribute those tasks across unreliable and heterogeneous agents, and reliably reassemble the results.
Unlike traditional centralized schedulers, the orchestrator is designed to operate in an environment where failure is normal and expected.
See also
- Architecture Overview
- Compute Agent
- Protocol
- Network Membership & Discovery
- Data Sources & Data Service
Role in the System
The orchestrator is responsible for global coordination, but not for execution.
It:
- Understands the structure of large jobs
- Splits jobs into tasks
- Assigns tasks to suitable agents
- Tracks task progress
- Handles failures and retries
It does not:
- Execute computation itself
- Assume agents are reliable
- Require global consensus
Design Goals
-
Fault tolerance by default
Node failures, disconnects, and slow responses are expected. -
Dynamic scheduling
Tasks are scheduled based on real-time node capabilities and availability. -
Eventual completion
Progress is asynchronous; correctness matters more than speed. -
Statelessness where possible
Orchestrators should be replaceable and restartable. -
Horizontal scalability
Multiple orchestrators may coexist without tight coordination.
Job Model
A job represents a large unit of work submitted by a user or system component.
A job:
- Is immutable once accepted
- Can be decomposed into smaller tasks
- Has clear completion criteria
- May include inputs as explicit tensors or as references to external data (see Data Sources)
Jobs may represent:
- Large numerical computations
- Training steps or epochs
- Evaluation or benchmarking workloads
Task Model
A task is the smallest schedulable unit of work.
Task properties:
- Bounded in time and memory
- Executable independently
- Retryable without side effects
- Inputs/outputs may be provided as URIs/handles (DataRefs) to minimize payload sizes
Tasks are designed to be:
- Idempotent
- Deterministic
- Stateless
This enables safe retries and redundant execution.
Job Decomposition
The orchestrator is responsible for transforming a job into a task graph.
Characteristics of decomposition:
- Tasks are as small as practical
- Dependencies are explicit
- Partial results are allowed
The decomposition strategy is job-specific and may evolve over time.
Scheduling Strategy
Task assignment is capability-aware.
The orchestrator considers:
- Declared agent capabilities
- Current load and availability
- Historical reliability (optional)
- Data locality and access costs when tasks use DataRefs
There is no assumption of fairness or long-term assignment stability.
Failure Handling
Failures are handled at the task level.
The orchestrator may:
- Retry tasks on the same agent
- Reschedule tasks to different agents
- Execute tasks redundantly
The orchestrator never assumes a failure is permanent.
Result Collection
Results are collected asynchronously.
The orchestrator:
- Validates result structure
- Associates results with tasks
- Aggregates partial results
- Accepts large outputs as references and may stage or materialize selectively
Out-of-order and duplicate results are expected and handled gracefully.
Result Validation & Agent Health
To maintain reliability with permissionless participation, the orchestrator applies validation to results from untrusted agents:
- Perform lightweight checks (e.g., invariants, checksums, recompute on a smaller sample, or redundant execution) before accepting a result.
- Mark agents as “bugged” if they fail validation beyond a threshold; quarantine or remove them from scheduling.
- Prefer trusted agents for critical paths or when resources are scarce.
- When inputs are remote, prefer validation methods that fetch only sampled slices rather than full datasets.
Trust is a scheduling optimization, not a correctness assumption; untrusted agents remain usable under validation.
State Management
The orchestrator maintains minimal durable state:
- Job definitions
- Task status
- Partial results
State should be:
- Serializable
- Recoverable after restart
- Independent of specific agents
- Independent of specific storage providers; keep only handles/metadata for data references
Relationship to Compute Agents
The orchestrator treats all agents as:
- Ephemeral
- Untrusted
- Replaceable
Agents are never assumed to be:
- Always available
- Correct
- Unique
This assumption simplifies orchestration logic.
Relationship to Protocol
All interaction with agents happens through the gRPC protocol.
The orchestrator:
- Never bypasses the protocol
- Never relies on out-of-band coordination
- Uses explicit versioned APIs
This keeps coupling low and evolution manageable.
Relationship to Training & Inference
The orchestrator is workload-agnostic.
Training and inference systems express their needs by defining:
- Job structure
- Task decomposition
- Result aggregation logic
The orchestrator provides execution guarantees, not ML semantics.
What the Orchestrator Is Not
The orchestrator is not:
- A global consensus system
- A blockchain
- A cluster manager in the traditional sense
- A real-time scheduler
It prioritizes robustness and simplicity over optimal utilization.
Summary
The orchestrator is the system’s coordination brain, but not its executor.
By assuming unreliable nodes and embracing asynchrony, it enables large-scale computation to emerge from many small, voluntary contributions.
This design trades peak efficiency for resilience, openness, and scalability through participation.