Disaster Recovery for Cloud Infrastructure:
A Step-by-Step Guide
Disaster recovery planning fails in two predictable ways: teams either skip it entirely ("we'll deal with it if it happens") or over-engineer it beyond their operational capability to maintain. A DR plan that cannot be executed under pressure is not a DR plan — it is a document. This guide covers the three DR tiers, how to choose the right one for each system, and what a working DR programme actually looks like in practice.
RTO and RPO: The Two Numbers That Drive Every Decision
Recovery Time Objective (RTO) is the maximum acceptable downtime — how long your business can survive with a given system offline. Recovery Point Objective (RPO) is the maximum acceptable data loss — how old your most recent backup can be when you restore. These are business decisions, not technical ones. A 4-hour RTO for an internal reporting dashboard is reasonable. A 4-hour RTO for a payment processing system is probably not. Define RTO and RPO for each system separately, because they will be different.
To quantify your RTO requirements, estimate the cost of one hour of downtime for each system: lost revenue (if the system is revenue-generating), support load (extra tickets, calls), SLA penalties you owe to your own customers, and reputational cost (harder to quantify, but real). Compare that cost to the monthly cost differential between DR tiers. This converts a subjective "how important is this system" question into a financial decision.
The Three DR Tiers
Tier 1: Backup and Restore
-
How it works — Regular backups taken and stored in a separate location. In a disaster, restore from the most recent backup to a new environment.
-
RTO — Hours (depends on backup size and restore speed). Not suitable for systems where hours of downtime are unacceptable.
-
RPO — Equal to backup frequency. Daily backups = up to 24 hours of data loss. Hourly backups = up to 1 hour.
-
Suitable for — Development environments, staging, internal tools, low-traffic applications where hours of downtime are acceptable.
Tier 2: Warm Standby
-
How it works — A secondary environment runs at reduced capacity in a separate zone or region, with database replication keeping it near-current. Failover involves promoting the standby and redirecting traffic.
-
RTO — 15–60 minutes (time to promote standby, verify, update DNS). The most common production tier.
-
RPO — Minutes (dependent on replication lag at the time of failure). With synchronous replication, RPO can approach zero.
-
Suitable for — Production web applications, databases for customer-facing systems, APIs that generate revenue.
Tier 3: Hot Standby / Active-Active
-
How it works — Full replica running live, serving real traffic. Failure of one site does not cause user-visible downtime — traffic routes automatically to the surviving site.
-
RTO — Seconds. Zero for users already connected to the surviving site.
-
RPO — Near-zero with synchronous replication. Effectively zero if using multi-master write distribution.
-
Suitable for — Tier-1 systems: payment processing, banking core systems, any application where seconds of downtime cause direct financial loss or regulatory breach.
Building and Testing Your DR Plan
A DR plan has seven components: (1) define scope — which systems are covered, (2) map dependencies — what does each system need to function, (3) assign RTO and RPO per system, (4) select the DR tier that meets those objectives, (5) write the runbook — step-by-step, with time estimates per step and decision points, (6) test annually at minimum, and (7) conduct a post-mortem after any real activation.
Testing is the part most teams skip and the part that matters most. A tabletop exercise — where the team walks through the runbook and identifies gaps — takes half a day and finds problems cheaply. A live DR drill — actually failing over to the standby and confirming recovery — takes a full day and validates that the plan actually works. Run the tabletop exercise quarterly and the live drill annually. If you cannot run a live DR drill without significant risk, your architecture is too fragile and that is the real problem to solve.