7.12 Test Disaster Recovery Plans (DRP)

  • Read-through/tabletop
  • Walkthrough
  • Simulation
  • Parallel
  • Full interruption

We should review and test the plan regularly so that we know that it is working.  No plan ever failed on paper.  Most plans fail when we actually put them into action.  But we can reduce the risk of failure if we practice many times so that every person knows exactly what to do.

Why do plans fail?

  • People panic because they don’t know their role, or they don’t know what they are supposed to do

  • The necessary people or supplies are not available

  • There is no clear hierarchy for making decisions

  • The plan is unrealistic.  The plan asks people to do things that are physically impossible

  • Even when we have a simulated disaster, people know that it is simulated so they are not under the same stress as when there is a real disaster.  In a real disaster, especially a natural disaster, the responders are more worried about their friends, family, home, pets, personal safety, etc., than the health of the company.

There are many types of tests we can run

  • Read-Through Test

    • We give everybody a copy of the plan and they read it at their own time

    • Everybody is aware of the plan and can suggest changes

    • This test won’t help much because nobody really practiced it

  • Structured Walk-Through

    • This is also known as a table top exercise

    • We gather all the responders in a room and pretend like we have a disaster

    • Ideally, the team doesn’t know anything about the disaster ahead of time

    • Each member follows his role according to the written plan

    • People are only talking, not doing stuff

    • We can record the walk through to see how well people respond

  • Simulation Test

    • In a Simulation Test, we pretend like we are having a disaster, but it is simulated

    • The team is required to develop a response to the simulation (or multiple responses)

    • We test each response to see how effective it was

    • We might interrupt some actual business activities to see the reaction

    • We must be careful not to create a real disaster

  • Parallel Test

    • We move disaster recovery staff to an alternate site and attempt to activate it

    • The staff pretend like they are involved in a real disaster

    • The original site stays operational

    • We see if the alternate site can be brought up

  • Full Interruption Test

    • We start up the alternate site and activate it.  In other words, we create a real disaster (but one that we can hopefully reverse).

    • We also shut down the primary site and transfer operations

    • The risk that the alternate site is not operational after we shut down the primary site might be high

We might not always be able to shut down the primary site without damaging the business