Orchestrator

The orchestrator is the coordinating component of the distributed computing system.

Its role is to take large computational jobs, decompose them into many small, independent tasks, distribute those tasks across unreliable and heterogeneous agents, and reliably reassemble the results.

Unlike traditional centralized schedulers, the orchestrator is designed to operate in an environment where failure is normal and expected.

Role in the System

The orchestrator is responsible for global coordination, but not for execution.

It:

Understands the structure of large jobs
Splits jobs into tasks
Assigns tasks to suitable agents
Tracks task progress
Handles failures and retries

It does not:

Execute computation itself
Assume agents are reliable
Require global consensus

Design Goals

Fault tolerance by default
Node failures, disconnects, and slow responses are expected.
Dynamic scheduling
Tasks are scheduled based on real-time node capabilities and availability.
Eventual completion
Progress is asynchronous; correctness matters more than speed.
Statelessness where possible
Orchestrators should be replaceable and restartable.
Horizontal scalability
Multiple orchestrators may coexist without tight coordination.

Job Model

A job represents a large unit of work submitted by a user or system component.

A job:

Is immutable once accepted
Can be decomposed into smaller tasks
Has clear completion criteria
May include inputs as explicit tensors or as references to external data (see Data Sources)

Jobs may represent:

Large numerical computations
Training steps or epochs
Evaluation or benchmarking workloads

Task Model

A task is the smallest schedulable unit of work.

Task properties:

Bounded in time and memory
Executable independently
Retryable without side effects
Inputs/outputs may be provided as URIs/handles (DataRefs) to minimize payload sizes

Tasks are designed to be:

Idempotent
Deterministic
Stateless

This enables safe retries and redundant execution.

Job Decomposition

The orchestrator is responsible for transforming a job into a task graph.

Characteristics of decomposition:

Tasks are as small as practical
Dependencies are explicit
Partial results are allowed

The decomposition strategy is job-specific and may evolve over time.

Scheduling Strategy

Task assignment is capability-aware.

The orchestrator considers:

Declared agent capabilities
Current load and availability
Historical reliability (optional)
Data locality and access costs when tasks use DataRefs

There is no assumption of fairness or long-term assignment stability.

Failure Handling

Failures are handled at the task level.

The orchestrator may:

Retry tasks on the same agent
Reschedule tasks to different agents
Execute tasks redundantly

The orchestrator never assumes a failure is permanent.

Result Collection

Results are collected asynchronously.

The orchestrator:

Validates result structure
Associates results with tasks
Aggregates partial results
Accepts large outputs as references and may stage or materialize selectively

Out-of-order and duplicate results are expected and handled gracefully.

Result Validation & Agent Health

To maintain reliability with permissionless participation, the orchestrator applies validation to results from untrusted agents:

Perform lightweight checks (e.g., invariants, checksums, recompute on a smaller sample, or redundant execution) before accepting a result.
Mark agents as “bugged” if they fail validation beyond a threshold; quarantine or remove them from scheduling.
Prefer trusted agents for critical paths or when resources are scarce.
When inputs are remote, prefer validation methods that fetch only sampled slices rather than full datasets.

Trust is a scheduling optimization, not a correctness assumption; untrusted agents remain usable under validation.

State Management

The orchestrator maintains minimal durable state:

Job definitions
Task status
Partial results

State should be:

Serializable
Recoverable after restart
Independent of specific agents
Independent of specific storage providers; keep only handles/metadata for data references

Relationship to Compute Agents

The orchestrator treats all agents as:

Ephemeral
Untrusted
Replaceable

Agents are never assumed to be:

Always available
Correct
Unique

This assumption simplifies orchestration logic.

Relationship to Protocol

All interaction with agents happens through the gRPC protocol.

The orchestrator:

Never bypasses the protocol
Never relies on out-of-band coordination
Uses explicit versioned APIs

This keeps coupling low and evolution manageable.

Relationship to Training & Inference

The orchestrator is workload-agnostic.

Training and inference systems express their needs by defining:

Job structure
Task decomposition
Result aggregation logic

The orchestrator provides execution guarantees, not ML semantics.

What the Orchestrator Is Not

The orchestrator is not:

A global consensus system
A blockchain
A cluster manager in the traditional sense
A real-time scheduler

It prioritizes robustness and simplicity over optimal utilization.

Summary

The orchestrator is the system’s coordination brain, but not its executor.

By assuming unreliable nodes and embracing asynchrony, it enables large-scale computation to emerge from many small, voluntary contributions.

This design trades peak efficiency for resilience, openness, and scalability through participation.