Uptime and health checks
Topic: Monitoring basics
Summary
Monitor service availability with HTTP, TCP, or script-based checks from one or more locations. Use this when you need to know when a service is down or degraded and to measure uptime and response time.
Intent: How-to
Quick answer
- HTTP check: GET a known endpoint (e.g. /health) from outside; expect 200 and optional body or latency. Run every 1–5 minutes from multiple regions or probes. Alert on consecutive failures (e.g. 2–3).
- TCP check: connect to port (e.g. 443); success if connection established. Use when the service has no HTTP or when you only care about reachability. Combine with HTTP for full stack check.
- Script or internal check: run a script that logs in, runs a query, or calls an API; success/failure drives alert. Use for deep health (e.g. DB connectivity from app host). Document what each check validates and the expected runbook.
Steps
-
Define health endpoint
Expose /health or /ready that returns 200 when the service is healthy (e.g. DB connected, cache reachable). Return 503 or non-200 when degraded. Keep the check fast and lightweight.
-
Configure external check
Use uptime service (e.g. UptimeRobot, Pingdom) or your own probe: GET https://yourapp/health every 1–5 min. Alert after 2–3 consecutive failures; optionally alert on latency > threshold.
-
TCP and multi-region
Add TCP check to port 443 or 80 if you want reachability without HTTP. Run checks from multiple regions to detect regional outages and to measure latency by region.
-
Runbook and SLA
Document what to do when the check fails (e.g. check app logs, DB, load balancer). Track uptime and SLA; report on availability and MTTR.
Summary
Use HTTP or TCP health checks from external probes to monitor availability; alert on consecutive failures and optionally on latency. Use this to measure uptime and to get notified when a service is down.
Prerequisites
None.
Steps
Step 1: Define health endpoint
Expose a /health or /ready endpoint that returns 200 when healthy and 503 when degraded.
Step 2: Configure external check
Set up an uptime check that hits the endpoint periodically; alert after N consecutive failures.
Step 3: TCP and multi-region
Add TCP checks and run from multiple regions for reachability and latency insight.
Step 4: Runbook and SLA
Document response to failures; track uptime and SLA.
Verification
- Check runs on schedule; alert fires when the service is down; runbook is followed and MTTR is tracked.
Troubleshooting
False positives — Increase consecutive failure count; check probe location and network. Check passes but app broken — Deepen the health check (e.g. DB query, cache read) and keep it fast.