How to set up disk, CPU, and memory alerts

Topic: Monitoring basics

Summary

Define alert rules for disk space, CPU usage, and memory (or swap) so you are notified before outages. Use thresholds and hysteresis to avoid flapping. Use this when configuring a monitoring system (e.g. Prometheus and Alertmanager, or cloud monitoring).

Intent: How-to

Quick answer

  • Disk: alert when filesystem usage >85% (warning) and >95% (critical). Use node_filesystem_avail_bytes / node_filesystem_size_bytes or equivalent. Alert on the mount point that matters (e.g. /, /var).
  • CPU: alert when usage is high for a sustained period (e.g. 5m average >80%) to avoid brief spikes. Memory: alert when available is low (e.g. <10%) or when swap usage is growing; include node_memory_MemAvailable_bytes.
  • Use for clause (e.g. for: 5m) so alert fires only after condition holds; add hysteresis (different threshold for resolve) if needed. Route alerts to email, Slack, or PagerDuty; document runbooks.

Prerequisites

Steps

  1. Disk alerts

    Rule: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85 for 5m. Critical at 0.95. Filter by mountpoint (/, /var). Resolve when below threshold.

  2. CPU and memory alerts

    CPU: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for 5m. Memory: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for 5m. Swap: node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes growing or > threshold.

  3. For and hysteresis

    Use for: 5m so transient spikes do not alert. Optionally: warning at 80%, critical at 95%; resolve when below 75% to avoid flapping.

  4. Route and runbook

    Send to Alertmanager or cloud alerting; route to team channel or PagerDuty. Add runbook link or summary (e.g. 'High disk: see runbook disk-full'); document how to fix.

Summary

Define disk, CPU, and memory alert rules with thresholds and for-clause; route alerts and add runbooks. Use this to get notified before resource exhaustion.

Prerequisites

Steps

Step 1: Disk alerts

Alert when filesystem usage is above 85% (warning) and 95% (critical) for a sustained period.

Step 2: CPU and memory alerts

Alert on sustained high CPU and low available memory; optionally on swap growth.

Step 3: For and hysteresis

Use for-clause to avoid flapping; use different resolve thresholds if needed.

Step 4: Route and runbook

Route alerts to the right channel; link or describe runbooks for each alert.

Verification

  • Alerts fire when the condition is met; resolve when condition clears; runbooks are available.

Troubleshooting

Alert storm — Increase for-clause or thresholds; add hysteresis. Missing alerts — Check metric names and labels; verify scrape and alert rule evaluation.

Next steps

Continue to