How Systems Complexity Reduces Uptime

Share on facebook
Share on twitter
Share on linkedin
The pace of change in IT is unlike any other discipline. Even dog years are long by comparison. It’s an environment that makes us so primed for what comes next that we often don’t pause to ask why. This is very much the case with cloud computing, where cloud is the sine qua non of any CIO’s strategy and leads to vague approaches such as ‘Cloud First’. This phenomenon also applies to application architecture, where microservices and composability are all the rage. The underlying rationale makes sense: break complex applications into function-specific components and assemble the pieces you need. After all, this is what software engineers do; they create and implement libraries of functionality that can be assembled in almost limitless ways. The result is an integrated collection of components are more elegant than the monolithic applications they replace. Technical elegance, though, isn’t always better. To illustrate why, we need to turn to probability - not the complex stuff like Bayesian decision theory or even Poisson distributions. I’m talking about availability, which in IT means the percentage of time a system is able to perform its function. A system that is 99.9% (‘three-nines’) available can perform its function all but 0.1% of the time. Pretty good, right? Of course the answer to that question is, it depends. Some applications are fine with three-nines. For others, four-nines, five-nines, and even higher are more appropriate. How you would feel, for instance, about getting on an airplane if there were a 1-in-1000 chance of "downtime"? From Component to Systems Availability The aggregate availability of a system is the product of the availabilities of each component. For example, the availability of a system with 3 interdependent components, each having 99.9% availability, is 99.9% x 99.9% x 99.9% = 99.7%. The following figure illustrates the availability of a system with multiple components, each with identical availability. The number of system components is show on the x-axis, and the aggregate system availability is shown on the y-axis (plotted on a logarithmic scale). The maximum potential monthly downtime for a given availability level is included on the second y-axis. [caption id="attachment_4" align="aligncenter" width="1028"] For component availabilities greater than ~99%, the following approximation is even simpler and goes as follows:[/caption] Ac = component availability Uc = component unavailability Us = system unavailability Nc = number of components Uc = 1 – Ac           [example: Uc = 1 – 99.9% = 0.1%] Us = 1 – nc · Uc     [example: 1 – 10 · 0.1% = 99.0%] The availability of a system with ten components, each having three-nines availability, is reduced to two-nines, increasing potential system downtime from 44 minutes per month to 7 hours 18 minutes per month. Implications for product managers, solutions architects, and CTOs The point of this article is not to condemn composable applications; it’s to encourage thoughtful, intentional design. Just because everyone seems to be doing something doesn’t mean you must. In the case of application and service architecture, this means viewing the system holistically. In support, I offer the following: Occam’s razor: entities should not be multiplied unnecessarily Albert Einstein: Everything should be made as simple as possible, but no simpler KISS principle: Keep It Simple … Key Takeaways Design for component failure and to minimize its impact. Make components decoupled and asynchronous whenever possible so that loss of one component does not cascade to others. Make critical components redundant and automatically scalable. Avoid stateful components whenever possible. Understand service interdependencies and failure modes. Eliminate unnecessary complexity (KISS).

This post was originally published on this site

Source: Gartner Blog Network On:

Read On

The pace of change in IT is unlike any other discipline. Even dog years are long by comparison. It’s an environment that makes us so primed for what comes next that we often don’t pause to ask why. This is very much the case with cloud computing, where cloud is the sine qua non of any CIO’s strategy and leads to vague approaches such as ‘Cloud First’.

This phenomenon also applies to application architecture, where microservices and composability are all the rage. The underlying rationale makes sense: break complex applications into function-specific components and assemble the pieces you need. After all, this is what software engineers do; they create and implement libraries of functionality that can be assembled in almost limitless ways. The result is an integrated collection of components are more elegant than the monolithic applications they replace.

Technical elegance, though, isn’t always better. To illustrate why, we need to turn to probability – not the complex stuff like Bayesian decision theory or even Poisson distributions. I’m talking about availability, which in IT means the percentage of time a system is able to perform its function. A system that is 99.9% (‘three-nines’) available can perform its function all but 0.1% of the time. Pretty good, right? Of course the answer to that question is, it depends. Some applications are fine with three-nines. For others, four-nines, five-nines, and even higher are more appropriate. How you would feel, for instance, about getting on an airplane if there were a 1-in-1000 chance of “downtime”?
From Component to Systems Availability
The aggregate availability of a system is the product of the availabilities of each component. For example, the availability of a system with 3 interdependent components, each having 99.9% availability, is 99.9% x 99.9% x 99.9% = 99.7%. The following figure illustrates the availability of a system with multiple components, each with identical availability. The number of system components is show on the x-axis, and the aggregate system availability is shown on the y-axis (plotted on a logarithmic scale). The maximum potential monthly downtime for a given availability level is included on the second y-axis.

For component availabilities greater than ~99%, the following approximation is even simpler and goes as follows: Ac = component availability
Uc = component unavailability
Us = system unavailability
Nc = number of components
Uc = 1 – Ac           [example: Uc = 1 – 99.9% = 0.1%] Us = 1 – nc · Uc     [example: 1 – 10 · 0.1% = 99.0%]

The availability of a system with ten components, each having three-nines availability, is reduced to two-nines, increasing potential system downtime from 44 minutes per month to 7 hours 18 minutes per month.

Implications for product managers, solutions architects, and CTOs
The point of this article is not to condemn composable applications; it’s to encourage thoughtful, intentional design. Just because everyone seems to be doing something doesn’t mean you must. In the case of application and service architecture, this means viewing the system holistically. In support, I offer the following:

Occam’s razor: entities should not be multiplied unnecessarily

Albert Einstein: Everything should be made as simple as possible, but no simpler

KISS principle: Keep It Simple …
Key Takeaways

Design for component failure and to minimize its impact.
Make components decoupled and asynchronous whenever possible so that loss of one component does not cascade to others.
Make critical components redundant and automatically scalable.
Avoid stateful components whenever possible.
Understand service interdependencies and failure modes.
Eliminate unnecessary complexity (KISS).

About the author: CIO Minute
Tell us something about yourself.

Leave a Comment

CIO Portal