Skip to main content

Observability

Operating software on cloud platforms requires metrics, logs, and traces. Observability supports deployment automation rollbacks and AI agent quality monitoring.

Error rate example (PromQL)

sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

What to instrument

SignalExamples
MetricsLatency, traffic, errors, saturation
LogsStructured JSON, correlation IDs
TracesOpenTelemetry across API and agent tools

SLO mindset

  • Define SLOs per user journey (docs read, chat response, API call)
  • Alert on error budget burn, not every blip
  • Dashboards for Kubernetes and serverless alike