Monitoring fundamentals

Effective monitoring requires:

  1. Availability checks — is the service up?
  2. Performance metrics — response times, resource usage
  3. Error rates — HTTP 4xx/5xx trends
  4. Business metrics — successful transactions, user flows

Alert on symptoms, not causes. Monitor what users experience (slow pages, errors) rather than low-level metrics (CPU at 70%) that may not affect service.

What to monitor

Essential metrics

  • HTTP status codes — track 4xx and 5xx error rates
  • Response time — p50, p95, p99 percentiles (not just average)
  • Availability — synthetic monitoring from multiple locations
  • SSL/TLS expiry — certificates expire, monitor 30+ days in advance

Optional but useful

  • CDN cache hit ratio — higher is better (less origin load)
  • DNS resolution time — slow DNS affects everything
  • Third-party dependencies — APIs you rely on
  • Disk space — especially for logs and uploads

Troubleshooting approach

When something breaks:

  1. Define the problem — what's actually failing?
  2. Gather data — logs, metrics, user reports
  3. Form hypothesis — what could cause this?
  4. Test systematically — rule out possibilities
  5. Implement fix — change one thing at a time
  6. Verify — confirm the issue is resolved
  7. Document — what happened and how you fixed it

Error codes hub

Common HTTP status codes and typical meanings:

  • 403 Forbidden — authentication passed but access denied (see detailed 403 guide)
  • 404 Not Found — resource doesn't exist at this path
  • 500 Internal Server Error — application crash or misconfiguration
  • 502 Bad Gateway — upstream server unreachable or returned invalid response
  • 503 Service Unavailable — temporary overload or maintenance
  • 504 Gateway Timeout — upstream took too long to respond

Each error type requires different diagnostic approaches. See the error codes reference for detailed troubleshooting.

Log analysis basics

Server logs show:

  • Request patterns (URLs, methods, user agents)
  • Error occurrences (with timestamps for correlation)
  • IP addresses (for abuse detection)
  • Response times (performance issues)

Parse logs systematically:

[timestamp] [IP] [method] [path] [status] [bytes] [response_time]

Look for patterns: sudden status code spikes, slow endpoints, unusual user agents (bots), repeated requests from single IPs (attacks or scrapers).