Observability
Logging, metrics, and tracing for agents and orchestrators to make the demo auditable and debuggable.
Logs
Recommended levels: debug, info, warn, error
Structured fields (examples):
-
component: agent orchestrator - task_id, job_id
-
operator: matmul ewise_add … - duration_ms, bytes_in, bytes_out
-
validation: started passed failed, rounds (for Freivalds)
Example agent log lines:
{"lvl":"info","component":"agent","event":"task_start","task_id":"...","operator":"matmul"}
{"lvl":"info","component":"agent","event":"task_done","task_id":"...","duration_ms":12}
Metrics (Prometheus-style)
Counters/gauges (suggested):
- compute_tasks_dispatched_total
- compute_tasks_completed_total
- compute_validation_attempts_total
- compute_validation_failures_total
- compute_agents_quarantined_total
- compute_bytes_transferred_total
- compute_task_latency_ms (histogram)
- compute_freivalds_rounds_total
Data & caching metrics:
- data_bytes_in_total
- data_bytes_out_total
- data_cache_hits_total
- data_cache_misses_total
- data_fetch_latency_ms (histogram)
Labels:
- operator, dtype, trusted (true/false), result (ok/failed)
-
source_scheme (mem file http s3 gs), cache (hit miss)
Tracing (optional)
Simple spans:
- orchestrator.dispatch
- agent.execute
- orchestrator.validate
Correlate spans by job_id/task_id.
Dashboards
- Task throughput and latencies by operator
- Validation failure rate over time
- Quarantine events and affected agents
See also: Demo (E2E) Guide