Skip to content

Observability

Monitoring Basics

Application monitoring is the ongoing process of identifying, measuring, and evaluating how applications perform in real time so teams can proactively detect and resolve issues, optimize user experience, and ensure business continuity.

  • Application monitoring definition: Ensures software performs as intended by tracking performance and health.
  • Why it matters: Enables proactive issue detection, dependency visibility, and user‑experience optimization.
  • Benefits: Higher availability, fewer incidents, better customer experience.
  • Scope: End‑to‑end visibility from UX to infrastructure across on‑prem, hybrid, and cloud‑native environments.
  • Techniques: Dashboards, alerts, anomaly detection, distributed tracing, and dependency mapping.
  • Tool selection factors: Ease of deployment, the metrics reported, and the quality of intelligent alerting.

What is application monitoring?

  1. Application performance
  2. Management tools
  3. Data collection
  4. User experience

Monitoring uses

  • Proactively observe application performance.
  • Isolate and fix issues with linked events or network calls.
  • Correlate user actions to system behavior and logs.
  • Deliver the best user experience.

Why monitoring is important

  • Keeps applications healthy; reduces outages and interruptions.
  • Reveals slow responses; sends actionable alerts.
  • Improves time to detect and time to resolve.

Types of Monitoring

System monitoring

Tracks availability, uptime, performance, server health, infrastructure, and network characteristics with continuous checks and fundamental metrics such as CPU, memory, disk, and network throughput.

Dependency monitoring

Monitors the downstream and upstream services your system relies on to quickly detect failures, pinpoint root causes, and understand blast radius across a distributed system.

Integration monitoring

  • APIs involved: External or third‑party services (e.g., weather, auth, social, payments).
  • Objectives: Monitor availability/uptime, contract behavior, and performance of integrations to protect core flows.

Web performance monitoring

Measures how fast and reliably a web app loads and behaves from a user’s perspective.

  • Key metrics: FCP/LCP, error locations, asset and third‑party load times.
  • Why it matters: Even +1s delay can increase bounce and reduce conversions.
  • Example tools: Lighthouse, PageSpeed Insights, Pingdom, New Relic Browser, WebPageTest, Sentry.

Security monitoring

Detects security attacks, anomalous traffic, and suspicious network behavior; tracks and blocks threats with retained logs for investigations.


Golden Signals of Monitoring

The four essential indicators of service health are latency, traffic, errors, and saturation. They provide a focused, actionable view for proactive monitoring and troubleshooting.

Latency

  • Time from request sent to completion. Track both successes and failures; compare against target SLOs.

Traffic

  • Demand for a service (e.g., transactions/requests per second). Reveals usage patterns and hotspots.

Errors

  • Obvious failures (5xx, exceptions) and subtle logical errors (wrong content with 200 OK). Critical for early issue detection.

Saturation

  • Percentage of resource utilization. Near 100% risks degradation; consistently <50% may indicate over‑provisioning.

Monitoring vs Evaluation

  • Monitoring: Ongoing, operational, performed by teams close to the system to ensure expected performance.
  • Evaluation: Periodic, business‑level assessment of value and effectiveness, often by independent parties.

Components of a Monitoring System

  • Metrics: Quantitative signals of resource usage or behavior.
  • Observability: Analyzing metrics/logs/traces/events to understand component behavior and patterns.
  • Alerting: Automated actions/notifications based on metric/log changes so humans aren’t watching dashboards 24/7.

Desired qualities: independent and reliable infra, easy dashboards, historical data retention, good data correlation, flexible alert routing.


Types of Metrics to Track

  • Host‑based: CPU, memory, disk, processes.
  • Application: Success/error rates, restarts, latency, resource usage.
  • Network and connectivity: Availability, latency, bandwidth, error rates/packet loss.
  • Server pool: Capacity, load handling, responsiveness of groups/clusters.
  • External dependencies: Third‑party availability, success/error rates, run rate, cost, resource limits.

What to choose depends on resources, app complexity, deployment environment, usefulness, stability requirements, and service maturity/SLOs.


Importance of Monitoring

  • Incident prevention: Faster detection, reduced downtime and cost.
  • Hardware/infrastructure efficiency: Better utilization, earlier fault detection, timely repair/replace.

What is Observability

Observability is the ability of a system to expose enough information about its internal state and cause–effect relationships without changing the code. It is essential for supporting, debugging, and scaling modern distributed systems.

When observability is missing

  • Missing or incomplete metrics and logs
  • No request correlation to follow a single request across services
  • Manual, time‑consuming investigations with limited visibility
  • The system looks “healthy” from the outside (no alerts, few metrics, partial logs)
  • Engineers tail logs for hours and restart services blindly

Why it matters

  • Without observability, the system becomes a black box.
  • Without traces/metrics/logs you cannot spot slow requests, identify who triggered an error, or explain load spikes.

MELT: the four data pillars

  • Metrics: numeric indicators of performance and state
  • Events: system and user signals
  • Logs: structured records for detailed analysis
  • Traces: request path across services

Practical observability stack and tooling

Data collection models: Push vs Scrape

Push (service pushes data)

  • Used for logs, correlation IDs, and short‑lived jobs (batch/cron)
  • Requires delivery guarantees and retries

Scrape (collector pulls data)

  • Prometheus scrapes stable services over HTTP, typically on /metrics
  • Simple, scalable, resilient to temporary outages
CharacteristicPush (logs, short‑lived tasks)Scrape (metrics)
InitiatorServicePrometheus
ModelAsynchronousPull
RequirementsAgent/exporterHTTP /metrics

Prometheus exporters and Node Exporter

  • Exporter: adapter that exposes metrics in Prometheus format
  • Node Exporter: OS/hardware metrics on port 9100
  • Other exporters: Blackbox, Postgres, JMX, NGINX, Redis

Prometheus: data model and key types

  • Time series = metric name + labels + value + timestamp
  • Types: Counter, Gauge, Histogram, Summary
  • Key functions: rate, irate, delta, idelta, aggregations, comparisons, absent, sort, sort_desc, timestamp

Grafana: visualization and alerting

  • Unified UI for metrics, logs, and events
  • Dashboards and alert rules with history, mute/silence, and notification policies
  • Typical flow: Grafana queries Prometheus for metrics and Loki for logs; panels can include Uptime Kuma and Kubernetes events

Logs: Loki + Promtail

  • Apps write structured JSON to stdout
  • Promtail discovers, parses, enriches, and ships logs to Loki
  • Loki indexes by labels for fast filtering and correlation

Lightweight request tracing via correlation (Node.js)

  • Assign a unique requestId per inbound request
  • Propagate with AsyncLocalStorage; add to every log line (e.g., pino)
  • Filter by requestId in Grafana/Loki to reconstruct paths; pair with latency/error metrics

Events: Grafana + Uptime Kuma + Kubernetes

  • External/internal uptime checks and Kubernetes events correlated with metrics/logs

Alerts

  • Rules on PromQL and log queries; notify Slack/Email/etc.; escalate and silence appropriately; track history

Summary & highlights

  • Monitoring is continuous measurement; evaluation is periodic value assessment
  • Golden Signals (latency, traffic, errors, saturation) focus teams on what matters
  • A complete monitoring system blends metrics, observability analysis, and alerting
  • Track host, application, network, server pool, and external dependency metrics
  • Practical stack: Prometheus (metrics), Loki/Promtail (logs), Grafana (viz/alerts), lightweight tracing via correlation IDs, uptime and events context

Future Improvements

Move toward an OpenTelemetry‑native stack with a single, vendor‑neutral collector and richer testing signals.

Grafana Alloy

  • Unify collection: One agent for metrics, logs, and traces. Replace scattered scrape/shipper configs with Alloy pipelines.
  • Deploy model: Run as DaemonSet on Kubernetes (or systemd on VMs).
  • Pipelines:
    • prometheus.scrapeprometheus.remote_write (metrics to Prometheus)
    • otelcol.receiver.otlptempo.write (traces) and loki.write (logs)
    • loki.source.file/loki.source.kubernetesloki.write (replace Promtail)
  • Benefits: Fewer agents, consistent relabeling, batching/retries/backpressure in one place.

OpenTelemetry

  • Standardize context: Use W3C traceparent and baggage; keep x-request-id as a fallback/correlation.
  • Instrument services: Adopt OTel SDK/auto‑instrumentation for HTTP, DB, and queue clients. Export via OTLP to Alloy.
  • Metrics + exemplars: Emit OTel metrics and link to traces; Alloy forwards to Prometheus RW for dashboards/SLOs.
  • Backends: Traces in Tempo; logs in Loki; metrics in Prometheus — all queried via Grafana.
  • Sampling: Start with 1–10% head sampling; enable tail sampling for slow/error spans on critical paths.

Remove Promtail

  • Migration: Move scrape/label/parser logic to Alloy (loki.source.* + loki.write). Decommission the Promtail DaemonSet.
  • Parity: Preserve existing labels (service, env, namespace, pod) to keep dashboards/alerts intact.

k6 integration

  • Workload testing: Add k6 smoke/load tests in CI and scheduled runs.
  • Metrics export: Send k6 metrics to Prometheus (remote_write or xk6‑output‑prometheus‑remote) for Grafana dashboards and burn‑rate alerts.
  • Scenarios: Use k6 browser for UX‑level checks; consider the k6 operator for Kubernetes‑native distributed tests.
  • Correlation: Annotate dashboards with test runs; compare latency/error budgets before/after releases.

Next steps (incremental)

  • PoC Alloy in non‑prod; mirror current Promtail and Prometheus scrape targets.
  • Instrument one service with OTel and ship OTLP to Alloy; verify traces in Tempo and exemplars in Grafana.
  • Migrate log pipelines; remove Promtail; update runbooks.