Uptime and health checks

Topic: Monitoring basics

Summary

Monitor service availability with HTTP, TCP, or script-based checks from one or more locations. Use this when you need to know when a service is down or degraded and to measure uptime and response time.

Intent: How-to

Quick answer

  • HTTP check: GET a known endpoint (e.g. /health) from outside; expect 200 and optional body or latency. Run every 1–5 minutes from multiple regions or probes. Alert on consecutive failures (e.g. 2–3).
  • TCP check: connect to port (e.g. 443); success if connection established. Use when the service has no HTTP or when you only care about reachability. Combine with HTTP for full stack check.
  • Script or internal check: run a script that logs in, runs a query, or calls an API; success/failure drives alert. Use for deep health (e.g. DB connectivity from app host). Document what each check validates and the expected runbook.

Steps

  1. Define health endpoint

    Expose /health or /ready that returns 200 when the service is healthy (e.g. DB connected, cache reachable). Return 503 or non-200 when degraded. Keep the check fast and lightweight.

  2. Configure external check

    Use uptime service (e.g. UptimeRobot, Pingdom) or your own probe: GET https://yourapp/health every 1–5 min. Alert after 2–3 consecutive failures; optionally alert on latency > threshold.

  3. TCP and multi-region

    Add TCP check to port 443 or 80 if you want reachability without HTTP. Run checks from multiple regions to detect regional outages and to measure latency by region.

  4. Runbook and SLA

    Document what to do when the check fails (e.g. check app logs, DB, load balancer). Track uptime and SLA; report on availability and MTTR.

Summary

Use HTTP or TCP health checks from external probes to monitor availability; alert on consecutive failures and optionally on latency. Use this to measure uptime and to get notified when a service is down.

Prerequisites

None.

Steps

Step 1: Define health endpoint

Expose a /health or /ready endpoint that returns 200 when healthy and 503 when degraded.

Step 2: Configure external check

Set up an uptime check that hits the endpoint periodically; alert after N consecutive failures.

Step 3: TCP and multi-region

Add TCP checks and run from multiple regions for reachability and latency insight.

Step 4: Runbook and SLA

Document response to failures; track uptime and SLA.

Verification

  • Check runs on schedule; alert fires when the service is down; runbook is followed and MTTR is tracked.

Troubleshooting

False positives — Increase consecutive failure count; check probe location and network. Check passes but app broken — Deepen the health check (e.g. DB query, cache read) and keep it fast.

Next steps

Continue to