Incident response checklist for Linux servers

Topic: Servers linux

Summary

When something is wrong: preserve logs and state, identify scope (one host or many), restore service or isolate, then root-cause and fix. Use this so incidents are handled consistently and evidence is kept for post-mortem.

Intent: Checklist

Quick answer

  • Preserve: copy logs (journalctl, /var/log), process list (ps), network (ss); do not reboot or wipe until you have what you need for diagnosis.
  • Scope: one server or many? one service or whole system? check monitoring and dependent systems; communicate to stakeholders.
  • Restore: fix config, restart service, free disk, or fail over; then root-cause (logs, changes, timeline) and document post-mortem and prevention.

Prerequisites

Steps

  1. Preserve evidence

    journalctl -u UNIT -b > unit.log; cp -a /var/log /backup/logs-$(date +%F); ps aux > ps.txt; ss -tulnp > ss.txt; note time and last changes.

  2. Determine scope

    Is the app down or the whole server? Check other hosts and dependencies; check monitoring and recent deploys or config changes.

  3. Restore or isolate

    Restart service, free disk, revert config, or fail over to standby; if compromise suspected, isolate the host from the network and escalate.

  4. Root-cause and document

    Review logs and timeline; identify cause (config, resource, bug, external); write post-mortem: what happened, cause, fix, prevention; update runbooks.

Summary

You will follow an incident checklist: preserve logs and state, determine scope, restore or isolate, then root-cause and document. Use this for any outage or suspected compromise so response is consistent and repeatable.

Prerequisites

  • Access to the server (SSH or console); backup and monitoring if available.

Steps

Step 1: Preserve evidence

Save logs, process list, and network state before rebooting or reconfiguring.

Step 2: Determine scope

One host or many? One service or system-wide? Check dependencies and recent changes.

Step 3: Restore or isolate

Fix and restart, or fail over; if compromise suspected, isolate and escalate.

Step 4: Root-cause and document

Analyze logs and timeline; write post-mortem and update runbooks and monitoring.

Verification

  • Service restored or host isolated; evidence saved; post-mortem written and shared.

Troubleshooting

Need to reboot to restore — Snapshot or copy logs first; then reboot and continue diagnosis from saved data.

Next steps

Continue to