Operations

Monitoring fundamentals

Effective monitoring requires:

Availability checks — is the service up?
Performance metrics — response times, resource usage
Error rates — HTTP 4xx/5xx trends
Business metrics — successful transactions, user flows

Alert on symptoms, not causes. Monitor what users experience (slow pages, errors) rather than low-level metrics (CPU at 70%) that may not affect service.

What to monitor

Essential metrics

HTTP status codes — track 4xx and 5xx error rates
Response time — p50, p95, p99 percentiles (not just average)
Availability — synthetic monitoring from multiple locations
SSL/TLS expiry — certificates expire, monitor 30+ days in advance

Optional but useful

CDN cache hit ratio — higher is better (less origin load)
DNS resolution time — slow DNS affects everything
Third-party dependencies — APIs you rely on
Disk space — especially for logs and uploads

Troubleshooting approach

When something breaks:

Define the problem — what's actually failing?
Gather data — logs, metrics, user reports
Form hypothesis — what could cause this?
Test systematically — rule out possibilities
Implement fix — change one thing at a time
Verify — confirm the issue is resolved
Document — what happened and how you fixed it

Error codes hub

Common HTTP status codes and typical meanings:

403 Forbidden — authentication passed but access denied (see detailed 403 guide)
404 Not Found — resource doesn't exist at this path
500 Internal Server Error — application crash or misconfiguration
502 Bad Gateway — upstream server unreachable or returned invalid response
503 Service Unavailable — temporary overload or maintenance
504 Gateway Timeout — upstream took too long to respond

Each error type requires different diagnostic approaches. See the error codes reference for detailed troubleshooting.

Log analysis basics

Server logs show:

Request patterns (URLs, methods, user agents)
Error occurrences (with timestamps for correlation)
IP addresses (for abuse detection)
Response times (performance issues)

Parse logs systematically:

[timestamp] [IP] [method] [path] [status] [bytes] [response_time]

Look for patterns: sudden status code spikes, slow endpoints, unusual user agents (bots), repeated requests from single IPs (attacks or scrapers).

Monitoring fundamentals

What to monitor

Essential metrics

Optional but useful

Troubleshooting approach

Error codes hub

Log analysis basics

Quick glossary

Related sections

Infrastructure

Security