Monitoring fundamentals
Effective monitoring requires:
- Availability checks — is the service up?
- Performance metrics — response times, resource usage
- Error rates — HTTP 4xx/5xx trends
- Business metrics — successful transactions, user flows
Alert on symptoms, not causes. Monitor what users experience (slow pages, errors) rather than low-level metrics (CPU at 70%) that may not affect service.
What to monitor
Essential metrics
- HTTP status codes — track 4xx and 5xx error rates
- Response time — p50, p95, p99 percentiles (not just average)
- Availability — synthetic monitoring from multiple locations
- SSL/TLS expiry — certificates expire, monitor 30+ days in advance
Optional but useful
- CDN cache hit ratio — higher is better (less origin load)
- DNS resolution time — slow DNS affects everything
- Third-party dependencies — APIs you rely on
- Disk space — especially for logs and uploads
Troubleshooting approach
When something breaks:
- Define the problem — what's actually failing?
- Gather data — logs, metrics, user reports
- Form hypothesis — what could cause this?
- Test systematically — rule out possibilities
- Implement fix — change one thing at a time
- Verify — confirm the issue is resolved
- Document — what happened and how you fixed it
Error codes hub
Common HTTP status codes and typical meanings:
- 403 Forbidden — authentication passed but access denied (see detailed 403 guide)
- 404 Not Found — resource doesn't exist at this path
- 500 Internal Server Error — application crash or misconfiguration
- 502 Bad Gateway — upstream server unreachable or returned invalid response
- 503 Service Unavailable — temporary overload or maintenance
- 504 Gateway Timeout — upstream took too long to respond
Each error type requires different diagnostic approaches. See the error codes reference for detailed troubleshooting.
Log analysis basics
Server logs show:
- Request patterns (URLs, methods, user agents)
- Error occurrences (with timestamps for correlation)
- IP addresses (for abuse detection)
- Response times (performance issues)
Parse logs systematically:
[timestamp] [IP] [method] [path] [status] [bytes] [response_time]
Look for patterns: sudden status code spikes, slow endpoints, unusual user agents (bots), repeated requests from single IPs (attacks or scrapers).