Home Fault Tolerance Server Outages: How to Recover Faster

Server outages will happen, and they’ll happen to the best of us. To believe otherwise is the equivalent of driving a car with no air bags because the manufacture promised that their cars never crash.

In 2017, the reliable Amazon Web Services (AWS) experienced a 4 hour outage that impacted all of the businesses that used AWS as a back-end provider. Four hours may not seem like a great length of time to restore a system of that enormity. However, for AWS customers like Netflix, whose site is accessed 24/7, those were four very expensive hours.

So how do you safeguard your organization and the customers that rely on its accessibility? When you are working with an availability solutions vendor, it’s important to establish which system will provide the fastest recovery time. Or best yet, which system will ensure that your customers don’t even realize the car has crashed when your server goes down.

The Downtime Prevention Buyer’s Guide discusses the six questions you should be asking to prevent downtime – including server failures. The guide recommends posing questions such as, “In the event of a server failure, what is the process to restore applications to normal processing operation and how long does it take?” The guide also compares the different levels of downtime that can be expected with specific systems.

“If you rely on standalone servers, your recovery time could range from minutes to days given the high level of human interaction required to restore the applications and data from backup — provided you’ve been backing up your system on a regular basis.
With high availability clusters, processing is interrupted during a server outage and recovery can take from minutes to hours depending on how long it takes to check file integrity, roll back databases, and replay transaction logs once availability is restored. If the cluster was sized correctly during the initial planning stages, users should not experience slower application performance while the faulty server is out of operation; they may, however, need to rerun some transactions using a journal file once normal processing resumes.
Fault-tolerant solutions proactively prevent downtime with fully replicated components that eliminate any single point of failure. Some platforms automatically manage their replicated components, executing all processing in lockstep.
Because replicated components perform the same instructions at the same time, there is zero interruption in processing — even if a component fails.This means that, unlike a standalone server or high availability cluster, the fault-tolerant solution keeps on functioning while any issue is being resolved.”

Download the entire Downtime Prevention Buyer’s Guide and discover the remaining five questions you should be asking to prevent downtime.

Related Posts