Observability

Monitoring Basics

Application monitoring is the ongoing process of identifying, measuring, and evaluating how applications perform in real time so teams can proactively detect and resolve issues, optimize user experience, and ensure business continuity.

Application monitoring definition: Ensures software performs as intended by tracking performance and health.
Why it matters: Enables proactive issue detection, dependency visibility, and user‑experience optimization.
Benefits: Higher availability, fewer incidents, better customer experience.
Scope: End‑to‑end visibility from UX to infrastructure across on‑prem, hybrid, and cloud‑native environments.
Techniques: Dashboards, alerts, anomaly detection, distributed tracing, and dependency mapping.
Tool selection factors: Ease of deployment, the metrics reported, and the quality of intelligent alerting.

What is application monitoring?

Application performance
Management tools
Data collection
User experience

Monitoring uses

Proactively observe application performance.
Isolate and fix issues with linked events or network calls.
Correlate user actions to system behavior and logs.
Deliver the best user experience.

Why monitoring is important

Keeps applications healthy; reduces outages and interruptions.
Reveals slow responses; sends actionable alerts.
Improves time to detect and time to resolve.

Types of Monitoring

System monitoring

Tracks availability, uptime, performance, server health, infrastructure, and network characteristics with continuous checks and fundamental metrics such as CPU, memory, disk, and network throughput.

Dependency monitoring

Monitors the downstream and upstream services your system relies on to quickly detect failures, pinpoint root causes, and understand blast radius across a distributed system.

Integration monitoring

APIs involved: External or third‑party services (e.g., weather, auth, social, payments).
Objectives: Monitor availability/uptime, contract behavior, and performance of integrations to protect core flows.

Web performance monitoring

Measures how fast and reliably a web app loads and behaves from a user’s perspective.

Key metrics: FCP/LCP, error locations, asset and third‑party load times.
Why it matters: Even +1s delay can increase bounce and reduce conversions.
Example tools: Lighthouse, PageSpeed Insights, Pingdom, New Relic Browser, WebPageTest, Sentry.

Security monitoring

Detects security attacks, anomalous traffic, and suspicious network behavior; tracks and blocks threats with retained logs for investigations.

Golden Signals of Monitoring

The four essential indicators of service health are latency, traffic, errors, and saturation. They provide a focused, actionable view for proactive monitoring and troubleshooting.

Latency

Time from request sent to completion. Track both successes and failures; compare against target SLOs.

Traffic

Demand for a service (e.g., transactions/requests per second). Reveals usage patterns and hotspots.

Errors

Obvious failures (5xx, exceptions) and subtle logical errors (wrong content with 200 OK). Critical for early issue detection.

Saturation

Percentage of resource utilization. Near 100% risks degradation; consistently <50% may indicate over‑provisioning.

Monitoring vs Evaluation

Monitoring: Ongoing, operational, performed by teams close to the system to ensure expected performance.
Evaluation: Periodic, business‑level assessment of value and effectiveness, often by independent parties.

Components of a Monitoring System

Metrics: Quantitative signals of resource usage or behavior.
Observability: Analyzing metrics/logs/traces/events to understand component behavior and patterns.
Alerting: Automated actions/notifications based on metric/log changes so humans aren’t watching dashboards 24/7.

Desired qualities: independent and reliable infra, easy dashboards, historical data retention, good data correlation, flexible alert routing.

Types of Metrics to Track

Host‑based: CPU, memory, disk, processes.
Application: Success/error rates, restarts, latency, resource usage.
Network and connectivity: Availability, latency, bandwidth, error rates/packet loss.
Server pool: Capacity, load handling, responsiveness of groups/clusters.
External dependencies: Third‑party availability, success/error rates, run rate, cost, resource limits.

What to choose depends on resources, app complexity, deployment environment, usefulness, stability requirements, and service maturity/SLOs.

Importance of Monitoring

Incident prevention: Faster detection, reduced downtime and cost.
Hardware/infrastructure efficiency: Better utilization, earlier fault detection, timely repair/replace.

What is Observability

Observability is the ability of a system to expose enough information about its internal state and cause–effect relationships without changing the code. It is essential for supporting, debugging, and scaling modern distributed systems.

When observability is missing

Missing or incomplete metrics and logs
No request correlation to follow a single request across services
Manual, time‑consuming investigations with limited visibility
The system looks “healthy” from the outside (no alerts, few metrics, partial logs)
Engineers tail logs for hours and restart services blindly

Why it matters

Without observability, the system becomes a black box.
Without traces/metrics/logs you cannot spot slow requests, identify who triggered an error, or explain load spikes.

MELT: the four data pillars

Metrics: numeric indicators of performance and state
Events: system and user signals
Logs: structured records for detailed analysis
Traces: request path across services

Practical observability stack and tooling

Data collection models: Push vs Scrape

Push (service pushes data)

Used for logs, correlation IDs, and short‑lived jobs (batch/cron)
Requires delivery guarantees and retries

Scrape (collector pulls data)

Prometheus scrapes stable services over HTTP, typically on /metrics
Simple, scalable, resilient to temporary outages

Characteristic	Push (logs, short‑lived tasks)	Scrape (metrics)
Initiator	Service	Prometheus
Model	Asynchronous	Pull
Requirements	Agent/exporter	HTTP `/metrics`

Prometheus exporters and Node Exporter

Exporter: adapter that exposes metrics in Prometheus format
Node Exporter: OS/hardware metrics on port 9100
Other exporters: Blackbox, Postgres, JMX, NGINX, Redis

Prometheus: data model and key types

Time series = metric name + labels + value + timestamp
Types: Counter, Gauge, Histogram, Summary
Key functions: rate, irate, delta, idelta, aggregations, comparisons, absent, sort, sort_desc, timestamp

Grafana: visualization and alerting

Unified UI for metrics, logs, and events
Dashboards and alert rules with history, mute/silence, and notification policies
Typical flow: Grafana queries Prometheus for metrics and Loki for logs; panels can include Uptime Kuma and Kubernetes events

Logs: Loki + Promtail

Apps write structured JSON to stdout
Promtail discovers, parses, enriches, and ships logs to Loki
Loki indexes by labels for fast filtering and correlation

Lightweight request tracing via correlation (Node.js)

Assign a unique requestId per inbound request
Propagate with AsyncLocalStorage; add to every log line (e.g., pino)
Filter by requestId in Grafana/Loki to reconstruct paths; pair with latency/error metrics

Events: Grafana + Uptime Kuma + Kubernetes

External/internal uptime checks and Kubernetes events correlated with metrics/logs

Alerts

Rules on PromQL and log queries; notify Slack/Email/etc.; escalate and silence appropriately; track history

Summary & highlights

Monitoring is continuous measurement; evaluation is periodic value assessment
Golden Signals (latency, traffic, errors, saturation) focus teams on what matters
A complete monitoring system blends metrics, observability analysis, and alerting
Track host, application, network, server pool, and external dependency metrics
Practical stack: Prometheus (metrics), Loki/Promtail (logs), Grafana (viz/alerts), lightweight tracing via correlation IDs, uptime and events context

Future Improvements

Move toward an OpenTelemetry‑native stack with a single, vendor‑neutral collector and richer testing signals.

Grafana Alloy

Unify collection: One agent for metrics, logs, and traces. Replace scattered scrape/shipper configs with Alloy pipelines.
Deploy model: Run as DaemonSet on Kubernetes (or systemd on VMs).
Pipelines:
- prometheus.scrape → prometheus.remote_write (metrics to Prometheus)
- otelcol.receiver.otlp → tempo.write (traces) and loki.write (logs)
- loki.source.file/loki.source.kubernetes → loki.write (replace Promtail)
Benefits: Fewer agents, consistent relabeling, batching/retries/backpressure in one place.

OpenTelemetry

Standardize context: Use W3C traceparent and baggage; keep x-request-id as a fallback/correlation.
Instrument services: Adopt OTel SDK/auto‑instrumentation for HTTP, DB, and queue clients. Export via OTLP to Alloy.
Metrics + exemplars: Emit OTel metrics and link to traces; Alloy forwards to Prometheus RW for dashboards/SLOs.
Backends: Traces in Tempo; logs in Loki; metrics in Prometheus — all queried via Grafana.
Sampling: Start with 1–10% head sampling; enable tail sampling for slow/error spans on critical paths.

Remove Promtail

Migration: Move scrape/label/parser logic to Alloy (loki.source.* + loki.write). Decommission the Promtail DaemonSet.
Parity: Preserve existing labels (service, env, namespace, pod) to keep dashboards/alerts intact.

k6 integration

Workload testing: Add k6 smoke/load tests in CI and scheduled runs.
Metrics export: Send k6 metrics to Prometheus (remote_write or xk6‑output‑prometheus‑remote) for Grafana dashboards and burn‑rate alerts.
Scenarios: Use k6 browser for UX‑level checks; consider the k6 operator for Kubernetes‑native distributed tests.
Correlation: Annotate dashboards with test runs; compare latency/error budgets before/after releases.

Next steps (incremental)

PoC Alloy in non‑prod; mirror current Promtail and Prometheus scrape targets.
Instrument one service with OTel and ship OTLP to Alloy; verify traces in Tempo and exemplars in Grafana.
Migrate log pipelines; remove Promtail; update runbooks.