What Exactly is High Availability Anyway?

by Jason Andersen June 11, 2015

So, at Stratus we are and have been the leaders in reliable computing infrastructure for decades. However, like many terms in technology, the definition of High Availability (HA) is very broad. Last year we saw a survey from a highly respected analyst firm that said the majority of those surveyed thought that High Availability meant having a disaster recovery plan. And we have found the definition also moves about when you talk to people with history in the different computing platforms (i.e. the mainframe vs dev/ops see this very differently). IDC has a set of Availability Levels they have used for years but they seem a bit broad since most of the technologies out there fall into the super broad AL3 category.

So, what is high availability? Here’s our definitions grouped by end user impact.

Significant End User Impact (Generally measured in hours of downtime – IDC calls this AL1 and AL2)

Unprotected – This is likely pretty easy to understand. This is a workload that has no special reliability features implemented either at the application, hypervisor or infrastructure layer. If it goes down; it’s down.

Backup – This is a workload that is periodically copied (or snapshotted) to a different node or data center. This is a nice compliance measure and can help to recover (if you have hours or more)

Disaster Recovery – This is a more robust form of backup that is automated for quicker recovery in the event of a major failure event (this could be human error or a major data center failure due to weather)

Minimal End User Impact (Generally measured in seconds to minutes of downtime – IDC calls this AL3)

Automated High Availability – This is very common in the virtualized world. When there is a failure a new instance of the workload is redeployed to a new node or data center. A common implementation of this is VMware’s HA feature. This feature has minimal infrastructure impact but has fairly high user interruption and all in-flight data is lost. This is a good solution for load balanced, scaled out applications like web servers.

Instant High Availability – This is the world of clusters in the bare metal world or redundant instances and replicated storage in the virtualized world. The interruption of service is minimal (even a sub-second in some cases). However, any inflight data and or transactions are lost. If your application is stateless but not load balanced this is a great solution.

Zero End User Impact (No Downtime – IDC calls this AL4)

Fault Tolerance – This is a capability that was once only known in the mainframe and minicomputer world. However, Stratus makes hardware, software and cloud solutions that provide this level of protection to off the shelf operating systems and hypervisors at a price point that is comparable to lower protection levels. Fault tolerance is a complete redundancy of the workload that also shares the inflight data and application state. This means that there is continuous, uninterrupted operation even in the event of a failure.

Multi-Site Fault Tolerance – This is the highest level of protection a workload can get. It provides Fault Tolerance, so there is no loss of state or data but the redundant workloads are hosted in different sites. Naturally, there is a higher network cost to this type of solution, but when only the highest levels will do, this is the best.

Hopefully this helps de-mystify all of the types of protection you can get. When evaluating what you need consider not only what specifically is being protected, but also the recovery time and the infrastructure costs – mainly processing and networking.

Want to learn more about availability at Stratus? Click the link below!

[sc name=”Availability_CTA_1″]

Jason Andersen

Jason Andersen is Vice President of Business Line Management and is responsible for setting the product roadmaps and go-to-market strategies for Stratus Products and Services. Jason has a deep understanding of both on-premise and cloud based infrastructure and has been responsible for the successful market delivery of products and services for almost 20 years. Prior to joining Stratus in 2013, Jason was Director of Product Line Management at Red Hat. In this role, he was responsible for the go to market strategy, product introductions and launches, as well as product marketing for the JBoss Application Products. Jason also previously held Product Management positions at Red Hat and IBM Software Group.

What Exactly is High Availability Anyway?

Significant End User Impact (Generally measured in hours of downtime – IDC calls this AL1 and AL2)

Minimal End User Impact (Generally measured in seconds to minutes of downtime – IDC calls this AL3)

Zero End User Impact (No Downtime – IDC calls this AL4)

Related Posts