We often repeat the words resilience and adaptability in the context of entire organizations, business models and whole economies.
We addressed this in our predictions for 2020 and in the article highlighting trends & technology 2021. There’s a consensus about resilience being a top priority for business, no matter if we are talking about the financial sector, pharma industry and healthcare, or any other industry. When we think about how deeply digital all the organizations have become, the resilience of IT systems and architectures has a very direct impact on the resilience of businesses.
Clients and business partners want to access their digital products and services whenever they want, and they are likely to spend less time and want their needs to be fulfilled easily all without any accidents. If it breaks too often they may lose patience and quickly switch to the competitors, and now it’s easier than ever.
As the cloud transformation is accelerating, hybrid clouds and evolutionary architectures are becoming new norms for enterprises. And, they introduce both great flexibility and scalability at the price of additional and often implicit complexity.
→ Let’s answer the question about Cloud lift and shift – to migrate or to transformate?
This famous quote from the CTO of Amazon Web Services tells the truth about the reliability of anything in IT. We can hope for the best, but we should never neglect the nature of complex systems and IT infrastructures. They do fail and nothing is going to stop this.
We cannot totally count on any single device, any single microservice or cloud function, nor any single data store. Redundancy and automatic API routing, in the case of failure, is the de facto standard now (not always implemented, but tools and techniques are there already).
The key question is, what are the weak spots of the entire architecture and infrastructures? How to find them quickly? Everything seems to be taken care of, performance is great, latency is low, and the scalability seems to be more than enough, so what else can we do?
No, there’s probably no movie about this problem available on Netflix, but their engineers are considered to be the first to popularize specific terms and definitions.
They came up with the idea of chaos engineering around 2011. The business goal was to provide a smooth and reliable user experience by avoiding major failures and outages of the service. If one percent of the users cannot access service for one hour, it’s bad indeed, but it’s nothing compared to four days of outage for all users. By the way, this did happen so the immediate need for better resilience became a must-have.
The idea is to test a system with thousands of small failures to detect weaknesses and single points of failures, then adapt the system and thus make it more resilient to failures. Of course, this is a process and not a one time activity. With each iteration, digital products become more and more resilient.
Embracing the randomness of failures is the key for better resilience. Each time a small failure brings down entire digital services, another fix will be designed and implemented to lower the impact of that particular type of failure. Because again, failures cannot be entirely avoided, but their occurrence can be minimized and chaos engineering will help to minimize the consequences of a single failure to entire systems.
The goal is to avoid a major catastrophe such as the lack of availability of client facing services for several hours or days.
As an example, let’s briefly analyze the recent failure of Google Workspace, which was offline for several hours and it was noticed all around the world. The cause was an authentication service failure (which has a strong dependency on other services as they require authentication). It probably means these types of weaknesses weren’t detected nor addressed properly by the Google cloud teams.
But surely, thousands of different failures have been already detected and the holes plugged in various cloud services, all before they ended up in major outages visible to the customers.
Let’s take a look at how chaos engineering is done, however it cannot be done without the right set of … chaos monkeys. Actually it’s virtually impossible to write an article about chaos engineering without involving the famous monkeys.
A chaos monkey is an automated failure simulation that is expected to break the system. Most of the monkeys now are done as software monkeys, but there are also examples of hardware monkeys. There are different kinds of monkeys for different failure simulations.
For instance, they may shut down the entire cloud availability zone, shut down any of the microservice instances, disconnect the database instance, or fill the service with millions of the wrong requests in order to kill performance, responsiveness, etc. They are aggressive, they are random, they are not afraid of consequences, and they will break things.
This type of monkey introduces delays in the communication between services. This is critical for a user experience so it’s very important to make the system resilient to latency. For example, something that normally took 20 milliseconds now takes 20 seconds. How will the other services behave? What will happen to the user experience?
In the case of very large systems, entire chaos monkey armies (many monkeys of different types) are unleashed to find out when and where the digital product will break.
There are also monkeys specific for browser or mobile front ends, that tap, click, and type in irregular and unexpected ways that weren’t anticipated by front-end developers. They help to detect many problems, from a simple lack of length constraints on the form fields to holes in validation routines. They may even crash the browser by detecting poor memory management in JS code which would not be detected by traditional tests.
For almost all of the popular technology stacks used to build business applications, there are dedicated chaos engineering tools which integrate better with their respective technology ecosystems.
In our modern times, with everything as a service, chaos engineering tools are also offered as a service. So you can borrow monkeys and pay as you (they) go.
One of the practical problems with chaos engineering is that the idea is very hot and popular, but still not very mature. What this means for architects and engineers is that there are tons of different tools with overlapping functionalities, many of them with uncertain futures and support lifecycles. It makes it all the more difficult to select the right tools (way too many choices) and there seems to be (ironically) chaos . . . in the chaos engineering tools landscape. These tools depend heavily on monitoring and tracing ecosystems, so the selection of tools has to be done carefully to make sure their dependencies are met.
Before we answer that question, let’s get rid of one of the key misconceptions.
Chaos engineering, despite its name, is very bad for unstable and untested systems, and there’s no work around so chaos engineering is not a replacement for automated functional and integration testing. It is designed to help already tested and stable systems to find previously undiscovered weaknesses, not typical functional bugs.
There are more prerequisites for chaos engineering such as microservice architectures, mature DevOps practices, advanced CI/CD pipelines, test automation, mature tracing, as well as monitoring and observability.
When any of these are not done properly, chaos engineering will be an annoyance and wasted effort.
However, when these conditions are met, chaos engineering can be a very good practice for your distributed systems in addition to microservice based serverless systems, especially when the numbers of services and processing nodes are high. And the complexity of the systems is growing, so even if it may seem like overkill today, it might be a good idea in the near future.
This is definitely a journey which includes multiple steps and numerous things to be done. We at Avenga, with our broad set of services and extensive experience, are here to help.