Monitoring checklist (before go-live)
Topic: Monitoring basics
Summary
Use this checklist before putting a system into production: metrics collected, key alerts defined, logs centralized, health checks in place, runbooks written, and on-call knows how to respond. Ensures you can detect and respond to incidents.
Intent: Checklist
Quick answer
- Metrics: CPU, memory, disk, and app-specific metrics are collected and visible. Alerts: disk full, high CPU, low memory, and service-down alerts are configured with thresholds and for-clause. Logs: key logs are forwarded and searchable.
- Health: HTTP or TCP health check runs from outside; alert on failure. Runbooks: each alert type has a runbook (what to check, what to do). On-call: team knows how to receive and acknowledge alerts; escalation path is clear.
- Test: trigger a test alert and confirm it fires and is received; run through runbook once. Document where dashboards and runbooks live; review and update after incidents.
Prerequisites
Steps
-
Metrics and alerts
Confirm metrics are scraped or sent; dashboards show key metrics. Alerts for disk, CPU, memory, and service availability are defined and routed. No critical gap (e.g. disk not monitored).
-
Logs and health
Logs from critical services are centralized and searchable. Health check is configured and alerting; runbook for 'service down' exists.
-
Runbooks and on-call
Every alert type has a runbook (or link to doc). On-call rotation or primary is set; escalation path is documented. Test alert is received and acknowledged.
-
Test and document
Trigger test alert; verify receipt and runbook. Document where to find dashboards, runbooks, and escalation. Schedule periodic review (e.g. quarterly) to update thresholds and runbooks.
Summary
Checklist for metrics, alerts, logs, health checks, runbooks, and on-call before go-live. Use this so production is observable and incidents can be triaged and resolved.
Prerequisites
Steps
Step 1: Metrics and alerts
Verify metrics and alerts cover disk, CPU, memory, and service availability; dashboards are in place.
Step 2: Logs and health
Confirm logs are centralized; health check and its runbook are in place.
Step 3: Runbooks and on-call
Ensure each alert has a runbook; on-call and escalation are set; test alert delivery.
Step 4: Test and document
Trigger a test alert; document where to find dashboards and runbooks; schedule reviews.
Verification
- All items are done; a test alert is received and runbook is followed; team knows how to respond.
Troubleshooting
Alert not received — Check routing and notification config; test channel. Runbook missing — Add at least a minimal runbook (what to check, who to escalate to) before go-live.