August 18, 2017

Design for Failure

Written by

The concept of “Design for Failure” is often used to describe the approach that assumes that there will be a hardware or system failure somewhere, sometime – and instead of architecting for hardware and server clustering and availability, to design applications so that recovery can be performed quickly.

Where the previous approach had been to ensure that high quality hardware was placed in a managed and maintained datacentre environment, with multiple layers of redundancy (disks, power supplies & UPS, AC, power and network cabling, redundant and clustered servers), and multi-layer monitoring – the Cloud approach is to ensure that applications are able to handle interruptions to service cleanly, and recover to continue to service clients.

The Amazon model is the “design for failure” model. Under the “design for failure” model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage

The advantage of the “design for failure” model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the “design for failure” model is that you must “design for failure” up front.

In the approach to modernise applications to take advantage of Cloud – there are considerations to take into account the re-design of the application to support underlying failures in the Cloud subsystems. This means;

  • Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
  • Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component without freezing, crashing or hanging (store/cache and recover)
  • Distribution of components across availability zones will be required to ensure availability of components
  • However, where latency is an issue and inter-dependence is required, plan for close location of resources
  • Instead of building completely new architecture for DR, create a Production system with AZ distribution, leveraging the inherent cloud capability as available.

Do you want failure?

A mis-interpretation of the concept of “design for failure” is that you actually want to fail, but the point is that you want to design with the expectation that their could be a failure of any component, at any time, and that you want to ensure that your application can cope with it. The point of “design for failure” is that you are not gold-plating every component, instead you are ensuring that there is always a consideration of what will happen.

Share Button
Proudly powered by WordPress and Sweet Tech Theme