Observability (logs, metrics, tracing)

Observability is the ability to understand system behavior from the outside using logs (discrete events), metrics (aggregated numbers over time), and tracing (request flow across services). Together they support debugging, alerting, and performance analysis.

Three pillars

flowchart TB subgraph Logs["Logs"] L1[Structured events] L2[Search, filter] end subgraph Metrics["Metrics"] M1[Counters, gauges, histograms] M2[Dashboards, alerts] end subgraph Tracing["Tracing"] T1[Trace ID across services] T2[Latency breakdown] end

Pillar	What	Use
Logs	Event records (timestamp, level, message, context)	Debugging, audit
Metrics	Numeric aggregates (QPS, latency p99, error rate)	Dashboards, SLOs, alerting
Tracing	Request path and timing across services (trace/span IDs)	Latency analysis, dependency map

Distributed trace

sequenceDiagram participant G as Gateway participant A as Service A participant B as Service B G->>A: request (trace_id=xyz) A->>B: request (same trace_id) B-->>A: response A-->>G: response Note over G,B: One trace_id ties all spans

Emit structured logs (JSON) with correlation IDs. Export metrics (Prometheus, StatsD) for RED (rate, errors, duration) or USE. Use distributed tracing (OpenTelemetry, Jaeger) with a common trace ID so you can follow a request across services.