Disaster Recovery: what happens when it all goes down
The beauty of a well-designed disaster recovery plan and designated solution(s) is that initiation of failover and recovery should be largely automated. By reducing the need for human intervention, the data recovery protocol takes care of failover, freeing your in-house IT team to troubleshoot and repair the root cause of the outage.
But what actually happens between the detection of a disaster and the resumption of normal service? The exact details – including timelines – will differ from business to business for infrastructure reasons, but the basic workflow looks like this:
15 seconds after failure
Heartbeat monitoring detects that a critical virtual server/application is no longer responding. The disaster recovery protocol is initiated automatically.
16 seconds after failure
The disaster recovery automatically updates network settings, reconfiguring IP and DNS addresses to point to the failover site. All client requests are now redirected to applications and data stores in the remote data center. If required, the in-house IT team will be alerted to carry out any manual operations to complete the failover.
At the same time, replication systems (Carbonite DoubleTake, Zerto) in the local data centre pause after capturing any changes since the outage was detected. The replication system at the remote data center begins recording changes to data in readiness for resumption of normal service.
60 seconds after failure
The in-house IT team begin the process of troubleshooting the local failure. After verifying that the failover has completed correctly, work begins in earnest to identify and correct the application/server/infrastructure error that has caused the local failure.
Because the failover is completely transparent to operations, the IT team can take their time. Rather than rushing to develop a temporary fix or workaround, the failover site provides them with the necessary breathing space to develop a “proper” fix that will prevent a recurrence.
180 minutes after failure
Having identified the cause of the failure, the IT team test on-site systems to ensure that the issue has been fully resolved, and that there are no secondary issues that need to be fixed.
Once testing is complete, the network team manually trigger the failback system. IP and DNS address settings are reverted to their defaults, and traffic is redirected to the on-site data centre.
181 minutes after failure
As soon as the settings change is detected, replication platforms begin synchronising all changes made during the outage. Each system also verifies that the changes are being applied correctly to prevent data loss.
The failback process is transparent to end users who will not notice any loss of service or availability issues, even as data changes are merged.
Further simplifying the DR process
The most complex aspect of the above scenario is the use of two data centres – effectively doubling the potential for the kind of failures that will invoke the DR protocol. By moving key applications into a cloud service like Microsoft Azure, resilience can be increased and management overhead reduced significantly.
Cloud services are built to maximise availability from the outset – customer data and systems are replicated between cloud data centres automatically. If a service fails in the hosted data center, the Azure service fails over to one of the other globally located facilities seamlessly – the backup VM images and data should be available from a different region. So the CTO avoids having to manage the process of bringing systems and applications back online.
The cloud approach can be further enhanced using the same failover technologies described in the scenario above. Instead of replicating between two physical data centers however, the process can be managed between cloud services offering ultimate control and availability.
To learn more about the DR failover process and how your business could improve systems availability, please get in touch.