Data Sources & Data Service

The distributed network must handle large tensors and datasets efficiently. Shipping big payloads with every task is wasteful and slow. Instead, we separate data into two categories:

Data Model

Explicit Data

Inputs/outputs are embedded directly in the task/result payload. This is suitable for small tensors (e.g., demo vectors, small matrices) and control data.

Implicit Data (References)

Inputs/outputs are passed as references to external locations. Agents and orchestrators fetch only the needed slices when validating or computing. References are expressed as URIs.

Supported URI schemes (initial):

mem:// — in‑memory buffers or ephemeral process store
file:// — local filesystem paths (sandboxed per policy)
http(s):// — generic object fetch via HTTP(S)
s3:// — Amazon S3 objects (usually via pre‑signed URLs)
gs:// — Google Cloud Storage (future)

Notes:

For cloud backends use pre‑signed URLs or scoped tokens; never ship long‑lived credentials to untrusted agents.
Large outputs can also be written by agents to a URI provided by the orchestrator, returning only a reference in results.

Views vs Materialization

Many operations can be represented as logical VIEWS over a remote dataset:

Transpose, slice/window, reshape, type cast (where safe), and some reductions produce derivable views without copying data.

Materialization creates a concrete in‑memory tensor from a reference or a view:

Use when subsequent operators require random access or when performance outweighs bandwidth cost.
Orchestrator may stage frequently used data locally to reduce egress.

Validation with references:

Cheap checks should fetch only sampled rows/cols/slices needed for invariants (e.g., Freivalds vectors, sampled indices for ewise/unary, partition sums for reduce).

Data Service

The Data Service is an optional component that manages registration and discovery of implicit data sources.

Responsibilities

Register datasets and return handles (stable IDs) that resolve to URIs
Store metadata: dtype, shape, layout, partitioning/chunking, checksums
Provide pre‑signed/short‑lived access URLs for trusted agents upon request
Track basic access logs and lifecycles (TTL, ownership)

Non‑Responsibilities

It does not execute compute
It is not a global source of truth; multiple data services may exist

API sketch (illustrative)

POST /data/register → {handle, uri, meta}
GET /data/{handle} → {uris[], meta} (may include pre‑signed links for the requester)
POST /data/sign → {signed_uri, expires_at} (for backends like S3)

Access & Trust Policies

Untrusted agents: fetch via anonymous/limited links; tight size/time limits; read‑only; no credential propagation.
Trusted agents: may receive short‑lived signed URLs bound to identity and constrained by IP/time.
Orchestrators should prefer locality and minimize cross‑region egress; schedule tasks near the data.

Privacy:

Treat task payloads and referenced data as potentially sensitive.
Prefer references + sampling for validation to avoid copying entire datasets.

Failure & Consistency Semantics

References may go stale; agents must handle 404/403/timeouts and report failures cleanly.
Data is immutable within a job; changing inputs under the same handle during a job is forbidden.
Checksums in metadata enable integrity verification for materialized tensors.

Caching Strategy

Agent local cache: bounded by size and time; keyed by {uri, range, checksum}.
Orchestrator staging: optional; cache popular tensors near schedulers.
Metrics: cache hit/miss, bytes_in/bytes_out, avg fetch latency.