Disaster Recovery Testing

Amazing! You just built a nice Disaster Recovery site, and your important Virtual Machines are being replicated to that location. You even went further and replicated all the important storage volumes as well. Everything looks fantastic. The supplier assures you a Recovery Point Objective less than 2 hours and guaranties a Recovery Time Objective of 1 hour. You ask him to fail everything over then and he asks you if you are really sure you want to do this. Why then did I deploy the solution then? You ask yourself.

Things you need to know

The Recovery Point Objective (RPO) defines the exact point in time you will be able to revert to in case of an unplanned failover. Take it this way: if your RPO is less than or equal to 2 hours, it simply means that you will lose a maximum of 2 hours of work when a “disaster” occurs at the primary location.

The Recovery Time Objective (RTO) represents the duration it takes to do all the magic and get your system to start working at the recovery site. It includes how long it will take to failover the systems.

The Replication window is the time it takes to perform the replication from your primary site to your secondary or DR site. So, a 1-hour replication window means that it takes 1 hour to copy the data from one site to another. This window typically determines the RPO.

Replication

There are many ways you can provide replication for a system to ensure high availability and continuity when a disaster occurs or during a planned downtime. Replication can be synchronous (a write is acknowledged only when the secondary location has received it) or asynchronous (the data is transferred based on a change or time trigger intermittently). Replication will typically only transfer the delta information and ensure 100% consistency with the source data; this means that a file corrupted at the primary site will be corrupted at the secondary site when the replication cycle completes.

Choosing whether you go with synchronous, or asynchronous replication, depends on your resources and the performance hit you are willing to tolerate. For VMs you can perform the replication at the VM level or at the Storage Array level. A software such as active directory does not need infrastructure-based replication because it inherently supports synchronization across multiple systems. All you need it is to have another DC at the target location and the data is replicated at the application level. Always try to use this option when it is available.

Let Failover

  1. Before you perform a planned failover (or DR test), you need to ensure the software supports a planned failover that will ensure last minute data is transferred over to the target location before cutting off the source system.
  2. What do you plan to do about systems that are changing to IP address changes? Unless you are using a Layer 2 extension solution that allows you to maintain the same IP addresses across the two solutions, the IP addresses of the individual systems will change when you fail them over. You must have a well-documented plan on how you are going to proceed when the systems are failed. Let me put this into perspective. A server with IP address 10.10.10.1 speaks to a database using 10.10.10.2. After failing that system over, the IP of the server becomes 10.10.20.1 and the database becomes 10.10.20.2. How is the application going to know that the IP of the database has changed and will then try to connect to the new IP? You need to have a clear plan to accommodate those changes. You can take advantage of scripting to make this part less intensive.
  3. You need a way to get the users to know they will be running from the DR site and a way to let them start using the new system addresses. This can be accomplished by changing the DNS information of the failed systems (provided you rely on DNS for system access).
  4. For physical systems, you may have replicated the important volumes. You still need to log into the physical systems to mount those volumes and start/test the applications.

Let Failback

We are going to assume you were able to do all that I said in the previous section and the test was a success. All selected components are now operating at the DR site and you are ready to bring them back to the primary site. Well it is not that easy. You need to perform a replication backward so that changes that happened whilst the systems were running at the DR site are brought back. Once that backward synchronization is done, you can safely fail the systems back and ensure all the application parameters are brough back to their original state.

Summary

Pfiou! That was a lot of work. Yes, it is a lot of work to do it for the first time. The key is to document all the steps and use scripting when appropriate. Running a DR test is never as easy as it sounds but it is key to meet industry standards and to ensure business continuity. It is advised to run a DR test at least twice every year.

Next Step

Our team of experienced systems engineer are always available to discuss your DR strategy and recommend the right solution for every system. To discuss your disaster recovery requirements and possible solutions, call us on +233.54.431.5710 or write to sales@apotica.net.

About Apotica

Apotica, headquartered in Accra, Ghana brings together the best information and communications technologies to help clients grow, compete and serve their customers better.