Monitoring checklist (before go-live)

Use this checklist before putting a system into production: metrics collected, key alerts defined, logs centralized, health checks in place, runbooks written, and on-call knows how to respond. Ensures you can detect and respond to incidents.

Intent: Checklist

Quick answer

Metrics: CPU, memory, disk, and app-specific metrics are collected and visible. Alerts: disk full, high CPU, low memory, and service-down alerts are configured with thresholds and for-clause. Logs: key logs are forwarded and searchable.
Health: HTTP or TCP health check runs from outside; alert on failure. Runbooks: each alert type has a runbook (what to check, what to do). On-call: team knows how to receive and acknowledge alerts; escalation path is clear.
Test: trigger a test alert and confirm it fires and is received; run through runbook once. Document where dashboards and runbooks live; review and update after incidents.

Prerequisites

Steps

Metrics and alerts

Confirm metrics are scraped or sent; dashboards show key metrics. Alerts for disk, CPU, memory, and service availability are defined and routed. No critical gap (e.g. disk not monitored).
Logs and health

Logs from critical services are centralized and searchable. Health check is configured and alerting; runbook for 'service down' exists.
Runbooks and on-call

Every alert type has a runbook (or link to doc). On-call rotation or primary is set; escalation path is documented. Test alert is received and acknowledged.
Test and document

Trigger test alert; verify receipt and runbook. Document where to find dashboards, runbooks, and escalation. Schedule periodic review (e.g. quarterly) to update thresholds and runbooks.

Summary

Checklist for metrics, alerts, logs, health checks, runbooks, and on-call before go-live. Use this so production is observable and incidents can be triaged and resolved.

Prerequisites

Steps

Step 1: Metrics and alerts

Verify metrics and alerts cover disk, CPU, memory, and service availability; dashboards are in place.

Step 2: Logs and health

Confirm logs are centralized; health check and its runbook are in place.

Step 3: Runbooks and on-call

Ensure each alert has a runbook; on-call and escalation are set; test alert delivery.

Step 4: Test and document

Trigger a test alert; document where to find dashboards and runbooks; schedule reviews.

Verification

All items are done; a test alert is received and runbook is followed; team knows how to respond.

Troubleshooting

Alert not received — Check routing and notification config; test channel. Runbook missing — Add at least a minimal runbook (what to check, who to escalate to) before go-live.

Monitoring checklist (before go-live)

Quick answer

Prerequisites

Steps

Metrics and alerts

Logs and health

Runbooks and on-call

Test and document

Summary

Prerequisites

Steps

Step 1: Metrics and alerts

Step 2: Logs and health

Step 3: Runbooks and on-call

Step 4: Test and document

Verification

Troubleshooting

Next steps

Continue to