How to build effective monitoring and observability for web applications, covering metrics, logs, traces, and alerting strategies.
Monitoring tells you when something is wrong. Observability tells you why. Both are necessary, but many teams invest heavily in monitoring dashboards while neglecting the ability to investigate and diagnose issues quickly.
Monitoring answers: Is the system healthy? Are response times within SLA? Are error rates elevated?
Observability answers: Why did this specific request take 8 seconds? Why did this user's payment fail? What happened between 2:00 AM and 2:15 AM that caused the error spike?
Metrics are numerical measurements collected over time. They tell you what is happening at a statistical level.
Essential application metrics:
The RED method for services:
The USE method for infrastructure:
Logs are discrete events with context. They tell you what happened in a specific execution path.
Structured logging is non-negotiable. Unstructured log lines ("Error processing invoice") are nearly useless for investigation. Structured logs ({"event": "invoice_processing_failed", "invoice_id": "inv-123", "error": "payment_declined", "customer_id": "cust-456"}) can be searched, filtered, and aggregated.
Log::error('Invoice processing failed', [
'invoice_id' => $invoice->id,
'customer_id' => $invoice->customer_id,
'error' => $exception->getMessage(),
'error_class' => get_class($exception),
'trace' => $exception->getTraceAsString(),
]);
Log levels matter:
ERROR and CRITICAL: Things that need immediate human attentionWARNING: Things that might become problemsINFO: Significant business events (user registered, payment processed, invoice sent)DEBUG: Detailed technical information for troubleshooting (disabled in production by default)Centralize your logs. Logs on individual servers are useless when you have multiple servers. Ship logs to a centralized system (ELK stack, Grafana Loki, Datadog, or a managed service) where you can search across all servers and services.
Distributed traces follow a single request through all the services and components it touches. They are essential for understanding performance in systems with multiple services, external API calls, and queue processing.
A trace for an invoice creation might show:
[200ms total]
├── [5ms] Authentication middleware
├── [10ms] Request validation
├── [50ms] CreateInvoiceAction
│ ├── [20ms] Database: INSERT invoice
│ ├── [15ms] Database: INSERT invoice_lines (3 rows)
│ └── [10ms] Calculate tax totals
├── [80ms] External API: tax calculation service
├── [3ms] Dispatch InvoiceCreated event
└── [2ms] Return JSON response
This immediately shows that the external tax API call is the bottleneck, not the database.
Correlation IDs tie logs and traces together. Generate a unique ID for each incoming request and include it in every log entry:
// Middleware
$requestId = Str::uuid()->toString();
Log::shareContext(['request_id' => $requestId]);
$response->headers->set('X-Request-ID', $requestId);
When investigating an issue, filter logs by the request ID to see everything that happened during that request, across all services.
Good alerts (symptoms):
Poor alerts (causes):
Symptom-based alerts reduce false alarms and alert fatigue. If users are not affected, it can wait until morning.
Every alert should link to a runbook that describes:
A runbook turns a 3 AM alert from "figure out what is going on" into "follow these steps."
One screen showing the health of the entire system:
This is what you look at first during an incident.
Detailed metrics for each service or domain:
Metrics that matter to non-technical stakeholders:
You do not need a complex observability platform on day one. Start with:
As your application and team grow, add metrics collection, distributed tracing, and sophisticated alerting. The important thing is to start, not to start perfectly.
Whether you're modernizing your infrastructure, navigating compliance, or building new software - we can help.
Book a 30-min Call