Observability

Logging, metrics, and tracing for agents and orchestrators to make the demo auditable and debuggable.

Logs

Recommended levels: debug, info, warn, error

Structured fields (examples):

component: agent orchestrator
task_id, job_id
operator: matmul ewise_add …
duration_ms, bytes_in, bytes_out
validation: started passed failed, rounds (for Freivalds)

Example agent log lines:

{"lvl":"info","component":"agent","event":"task_start","task_id":"...","operator":"matmul"}
{"lvl":"info","component":"agent","event":"task_done","task_id":"...","duration_ms":12}

Metrics (Prometheus-style)

Counters/gauges (suggested):

compute_tasks_dispatched_total
compute_tasks_completed_total
compute_validation_attempts_total
compute_validation_failures_total
compute_agents_quarantined_total
compute_bytes_transferred_total
compute_task_latency_ms (histogram)
compute_freivalds_rounds_total

Data & caching metrics:

data_bytes_in_total
data_bytes_out_total
data_cache_hits_total
data_cache_misses_total
data_fetch_latency_ms (histogram)

Labels:

operator, dtype, trusted (true/false), result (ok/failed)
source_scheme (mem file http s3 gs), cache (hit miss)

Tracing (optional)

Simple spans:

orchestrator.dispatch
agent.execute
orchestrator.validate

Correlate spans by job_id/task_id.

Dashboards

Task throughput and latencies by operator
Validation failure rate over time
Quarantine events and affected agents