Monitoring checklist (before go-live)

Topic: Monitoring basics

Summary

Use this checklist before putting a system into production: metrics collected, key alerts defined, logs centralized, health checks in place, runbooks written, and on-call knows how to respond. Ensures you can detect and respond to incidents.

Intent: Checklist

Quick answer

  • Metrics: CPU, memory, disk, and app-specific metrics are collected and visible. Alerts: disk full, high CPU, low memory, and service-down alerts are configured with thresholds and for-clause. Logs: key logs are forwarded and searchable.
  • Health: HTTP or TCP health check runs from outside; alert on failure. Runbooks: each alert type has a runbook (what to check, what to do). On-call: team knows how to receive and acknowledge alerts; escalation path is clear.
  • Test: trigger a test alert and confirm it fires and is received; run through runbook once. Document where dashboards and runbooks live; review and update after incidents.

Prerequisites

Steps

  1. Metrics and alerts

    Confirm metrics are scraped or sent; dashboards show key metrics. Alerts for disk, CPU, memory, and service availability are defined and routed. No critical gap (e.g. disk not monitored).

  2. Logs and health

    Logs from critical services are centralized and searchable. Health check is configured and alerting; runbook for 'service down' exists.

  3. Runbooks and on-call

    Every alert type has a runbook (or link to doc). On-call rotation or primary is set; escalation path is documented. Test alert is received and acknowledged.

  4. Test and document

    Trigger test alert; verify receipt and runbook. Document where to find dashboards, runbooks, and escalation. Schedule periodic review (e.g. quarterly) to update thresholds and runbooks.

Summary

Checklist for metrics, alerts, logs, health checks, runbooks, and on-call before go-live. Use this so production is observable and incidents can be triaged and resolved.

Prerequisites

Steps

Step 1: Metrics and alerts

Verify metrics and alerts cover disk, CPU, memory, and service availability; dashboards are in place.

Step 2: Logs and health

Confirm logs are centralized; health check and its runbook are in place.

Step 3: Runbooks and on-call

Ensure each alert has a runbook; on-call and escalation are set; test alert delivery.

Step 4: Test and document

Trigger a test alert; document where to find dashboards and runbooks; schedule reviews.

Verification

  • All items are done; a test alert is received and runbook is followed; team knows how to respond.

Troubleshooting

Alert not received — Check routing and notification config; test channel. Runbook missing — Add at least a minimal runbook (what to check, who to escalate to) before go-live.

Next steps

Continue to