Software Teams

Monitoring and Observability for Web Applications

How to build effective monitoring and observability for web applications, covering metrics, logs, traces, and alerting strategies.

Monitoring vs. Observability

Monitoring tells you when something is wrong. Observability tells you why. Both are necessary, but many teams invest heavily in monitoring dashboards while neglecting the ability to investigate and diagnose issues quickly.

Monitoring answers: Is the system healthy? Are response times within SLA? Are error rates elevated?

Observability answers: Why did this specific request take 8 seconds? Why did this user's payment fail? What happened between 2:00 AM and 2:15 AM that caused the error spike?

The Three Pillars

Metrics

Metrics are numerical measurements collected over time. They tell you what is happening at a statistical level.

Essential application metrics:

  • Request rate (requests per second, by endpoint)
  • Error rate (4xx and 5xx responses, by endpoint)
  • Response time (p50, p95, p99 percentiles, by endpoint)
  • Queue depth (pending jobs per queue)
  • Queue processing time (per job type)
  • Database query time (average, p95, p99)
  • Cache hit/miss rate (by cache key prefix)
  • Memory usage (per process, per server)
  • CPU utilization (per server)

The RED method for services:

  • Rate: How many requests per second?
  • Errors: How many of those requests are failing?
  • Duration: How long do those requests take?

The USE method for infrastructure:

  • Utilization: How much of the resource is being used?
  • Saturation: How much work is waiting?
  • Errors: Are there error conditions?

Logs

Logs are discrete events with context. They tell you what happened in a specific execution path.

Structured logging is non-negotiable. Unstructured log lines ("Error processing invoice") are nearly useless for investigation. Structured logs ({"event": "invoice_processing_failed", "invoice_id": "inv-123", "error": "payment_declined", "customer_id": "cust-456"}) can be searched, filtered, and aggregated.

Log::error('Invoice processing failed', [
    'invoice_id' => $invoice->id,
    'customer_id' => $invoice->customer_id,
    'error' => $exception->getMessage(),
    'error_class' => get_class($exception),
    'trace' => $exception->getTraceAsString(),
]);

Log levels matter:

  • ERROR and CRITICAL: Things that need immediate human attention
  • WARNING: Things that might become problems
  • INFO: Significant business events (user registered, payment processed, invoice sent)
  • DEBUG: Detailed technical information for troubleshooting (disabled in production by default)

Centralize your logs. Logs on individual servers are useless when you have multiple servers. Ship logs to a centralized system (ELK stack, Grafana Loki, Datadog, or a managed service) where you can search across all servers and services.

Traces

Distributed traces follow a single request through all the services and components it touches. They are essential for understanding performance in systems with multiple services, external API calls, and queue processing.

A trace for an invoice creation might show:

[200ms total]
├── [5ms] Authentication middleware
├── [10ms] Request validation
├── [50ms] CreateInvoiceAction
│   ├── [20ms] Database: INSERT invoice
│   ├── [15ms] Database: INSERT invoice_lines (3 rows)
│   └── [10ms] Calculate tax totals
├── [80ms] External API: tax calculation service
├── [3ms] Dispatch InvoiceCreated event
└── [2ms] Return JSON response

This immediately shows that the external tax API call is the bottleneck, not the database.

Correlation IDs tie logs and traces together. Generate a unique ID for each incoming request and include it in every log entry:

// Middleware
$requestId = Str::uuid()->toString();
Log::shareContext(['request_id' => $requestId]);
$response->headers->set('X-Request-ID', $requestId);

When investigating an issue, filter logs by the request ID to see everything that happened during that request, across all services.

Alerting Strategy

Alert on Symptoms, Not Causes

Good alerts (symptoms):

  • Error rate exceeds 5% for 5 minutes
  • p95 response time exceeds 2 seconds for 10 minutes
  • Queue depth exceeds 1,000 jobs for 15 minutes

Poor alerts (causes):

  • CPU usage above 80% (high CPU is fine if users are not affected)
  • Disk usage above 70% (needs monitoring but not paging at 3 AM)
  • Individual job failure (one failure is not an incident)

Symptom-based alerts reduce false alarms and alert fatigue. If users are not affected, it can wait until morning.

Alert Severity Levels

  • Critical (page someone immediately): Service is down, data loss is occurring, payments are failing
  • Warning (notify during business hours): Performance degraded, error rate elevated but below critical threshold, disk space running low
  • Info (review in daily standup): Unusual patterns, approaching thresholds, non-urgent anomalies

Runbooks

Every alert should link to a runbook that describes:

  1. What the alert means
  2. What to check first
  3. Common causes and their fixes
  4. Escalation path if the on-call engineer cannot resolve it

A runbook turns a 3 AM alert from "figure out what is going on" into "follow these steps."

Dashboard Design

Overview Dashboard

One screen showing the health of the entire system:

  • Traffic (requests per minute)
  • Error rate
  • Response time percentiles
  • Active users
  • Queue health
  • Database performance

This is what you look at first during an incident.

Service-Level Dashboards

Detailed metrics for each service or domain:

  • Endpoint-level response times and error rates
  • Database query performance
  • External dependency latency
  • Business metrics (invoices processed, payments collected)

Business Dashboards

Metrics that matter to non-technical stakeholders:

  • Revenue processed per hour
  • User signups and activation
  • Feature adoption rates
  • SLA compliance

Starting Simple

You do not need a complex observability platform on day one. Start with:

  1. Application error tracking (Flare, Sentry, Bugsnag) for immediate visibility into exceptions
  2. Structured logging to a centralized service
  3. Basic uptime monitoring (external ping service)
  4. Database slow query logging

As your application and team grow, add metrics collection, distributed tracing, and sophisticated alerting. The important thing is to start, not to start perfectly.

Let's talk about your software teams needs

Whether you're modernizing your infrastructure, navigating compliance, or building new software - we can help.

Book a 30-min Call