Database disaster recovery basics

Topic: Databases core

Summary

Define RPO and RTO for databases; use backups and optionally replication to meet them. Restore from backup or fail over to a replica; document and test the procedure. Use this when planning or executing database recovery after a failure or data loss.

Intent: How-to

Quick answer

  • RPO: maximum acceptable data loss (e.g. 1 hour). Drives backup or replication frequency. RTO: maximum acceptable downtime. Drives restore speed and whether you need a hot standby.
  • Backup and restore: restore from last good backup; accept data loss since backup. Replication: promote replica for failover; minimal data loss if sync replication. Test both: restore to a staging DB; run failover drill.
  • Document: backup location, restore command, who runs it, and order (e.g. restore DB then app). Keep credentials and runbook where they are available during an outage (e.g. second region or offline).

Prerequisites

Steps

  1. Define RPO and RTO

    RPO: how much data loss is acceptable (e.g. 1 hour). RTO: how long until the system must be back (e.g. 4 hours). These determine backup frequency and whether you need replication and standby.

  2. Backup and restore procedure

    Document where backups are stored; exact restore command (pg_restore or mysql); order of restore (globals, then DB). Estimate restore time; ensure backup is from before the incident and is not corrupted.

  3. Replication and failover

    If you have a replica, document promote procedure and client reconfiguration. Test failover; measure data loss (replication lag at failover time). Re-establish replica from new primary after failover.

  4. Test and update

    Run restore and failover tests on a schedule (e.g. quarterly). Update runbook with any changes; ensure credentials and access work when primary is down.

Summary

Set RPO and RTO; document backup restore and optional failover; test regularly. Use this to plan and execute database disaster recovery.

Prerequisites

Steps

Step 1: Define RPO and RTO

Decide acceptable data loss and downtime; use them to drive backup and replication design.

Step 2: Backup and restore procedure

Document backup location and restore steps; estimate restore time; verify backup integrity.

Step 3: Replication and failover

Document promote and client switchover; test; plan for re-establishing replica after failover.

Step 4: Test and update

Run restore and failover tests; update runbooks and fix gaps.

Verification

  • RPO and RTO are documented; restore and failover procedures are tested and current.

Troubleshooting

Restore too slow — Use parallel restore (pg_restore -j); consider faster storage or larger instance. Replica lag at failover — Accept data loss or use synchronous replication; document expected loss.

Next steps

Continue to