Director of Avenga Labs
The very word “resilience” has become extremely popular recently. Emotional resilience, economical resilience, cyber resilience, etc. are all over the Internet.
Resilience in digital transformation is the key property of entire IT departments, infrastructures and software products. Resilience of solution architecture is critical to building trust with the customers and enables business units to focus on delivering new ideas without worrying about the IT side of things.
From the perspective of IT management and architecture, there are many components and sub-goals to meet and maintain the resilience requirements. Let’s review the main components of resilience’s requirements.
Digital resilience is much broader than IT resilience, as IT resilience is a critical component of it but it’s not synonymous. There are many aspects, I call them faces, so let’s review some of them. Please bear in mind that each of the subcategories could be the topic of a thick book.
Networks, servers, data storage systems, end user PCs, IoT devices, and smartphones, all communicate with each other. Also, they may fail at any time. So, it is the traditional but never ending job of IT to ensure the fundamentals on which entire digital ecosystems are based.
We cannot just jump over infrastructure resilience and talk about the higher level abstractions.
The key strategy is usually redundancy, real time monitoring and alerting, plus automated corrective actions, to minimize downtimes and disruptions.
It also involves selecting the hardware and software that delivers higher resiliency and working with reliable IT service companies.
Vendors deliver solutions in different models. In the case of software as a service (SaaS), the time to market is the lowest, however there’s a deep dependency on the external vendor. The usual example of what-if is: what if the vendor goes out of business. Business history proves that no one is too big to fail.
But, between those two ends of the spectrum (everything works fine vs. vendor goes out of business) there are many shades of gray.
For instance, the availability of the system or APIs is much lower than expected and agreed upon. The key tactic here might be to enforce penalties and put some pressure on the vendor to step up their game. But it may not be enough, especially when the vendor is in a dire economic situation and no amount of pressure can change anything.
So, it’s good to keep in mind that there’s almost always another API provider or providers with similar capabilities. For instance, e-commerce sites have multiple payment gateways to make sure the clients can choose what they want, but also for a greater resilience. In case one of the providers fails (temporarily or permanently), it won’t mean the end of the customer’s journey.
External vendors with their skills and experiences are of great help in speeding up the innovation processes.
→ Explore how Avenga delivers the expertise
However, the resilience aspect in this context is more focused on the risks associated with external IT services vendors.
One of the key issues is that of the ownership of code and the rights to continue working on the same code base with another vendor or within an internal software house.
Another is the resilience of external vendors as a company and the particular teams which are responsible for delivery of the solution.
It is often optimal to have architecture teams and testing capabilities onboard, even when redundant, so as to not let the vendors take over the entire project or to not even be fully aware of the quality of the product.
Redundant storages, clusters of database engines, etc. are the elements of the data ecosystem that support resilience.
But what about in the case of SaaS?
Resilience demands the ability to work with the data from the SaaS solution, in case it fails. IT should design and set up regular data synchronizations as the key requirement for a SaaS solutions purchase.
In the worst case scenario, the company can continue working using various workarounds as the business data is available to move on; it is slower but still moving forward.
Languages come and go. Some stay forever like Cobol, for instance. In this case, there are two main strategies. One is to keep our fingers crossed that there will always be someone somewhere who is able to maintain and evolve the solution. The other is to plan and execute a change of the solution for one that does not depend on legacy technology. It is the second approach that increases resilience, as the first one is classical, hoping for the best without preparing for the worst.
Even the most promising technologies used have been abandoned in the past or lost virtually all of the developer community’s attention. In order to minimize that risk, it’s always better to bet on the most popular technologies in a given area, as it increases the chances of not being alone when the technology eventually fades away.
Resilience can also be impaired by the overall dependency of the teams on a few individuals. I remember an interesting discussion with a CIO who said that he preferred to be vendor-locked on a particular cloud service than to have internal lock with a few super DevOps experts working in a silo.
→ Look into What happened to NoOps
There’s also the key aspect of knowledge management about particular code bases, configurations and deployments. Some IT experts understand that explaining and documenting for others is a natural part of their jobs, while other IT people cannot be forced by anything to do so. This is part of the resilience effort that does happen or does not every day. It’s very important to not overlook all these tiny bits which may accumulate over time and impair resilience significantly.
Everything is becoming a code now and code requires knowledge management. It’s one of the tactics in lowering the risk of heroes leaving the company. The labor market is so overheated now (summer 2021) that resilience related to human resources is more important than ever.
It is commonly believed that in the long run the cost will be higher for maintaining one’s own infrastructure, updating it, and replacing hardware and software elements, etc., but also that the focus of IT organizations should be on software and data, not on infrastructure.
If one looks only at the cost side, the cloud may be more expensive, but when resilience is added to the equation, the cloud may easily win out.
There are so many things that local IT departments simply don’t have to do and that are done by cloud providers using their skilled teams and tons of automation.
On the other hand, when the cloud is down it sometimes seems that there’s nothing local IT can do. One of the smart tactics is to use cloud native architectures (Kubernetes, Docker) which are a perfect fit for hybrid clouds, balancing the different clouds and local infrastructures. When some of the nodes or entire clusters fail, the traffic will be transferred to a different node or cluster at a different cloud or server room, keeping things running smoothly.
Resilience is one of the reasons that hybrid clouds will stay around for a long time.
Ultimately, enterprises will eventually forgo the idea of building local infrastructures. Then, the solution for resilience is and will be to use resilience features of the primary cloud provider (different zones, redundancy, backups, monitoring, etc.). If that won’t be enough, the other cloud provider will act as redundancy for the primary.
Resilience “likes” standards, predictability and uniformity. It sounds like the opposite of innovation and disruption, but it’s not. The groundwork for resilience has to be done in a structured way. It also helps with the automation of most of the typical scenarios and it lets people focus on exploration and innovation, also in a resilience space.
Resilience is a combination of response times when something wrong happens and the proactive analysis of the IT solution.
Testing, as part of CI/CD pipeline, is the norm nowadays, but there’s always the question of accuracy and relevance of the current test set and data.
And, not everything can be tested or the cost of testing it before release would stop the entire digital revolution. So the proper observability, which we addressed in another article, seems to be a must to enable the proactive maintenance and safe evolutions of software.
Evolutionary architecture is one of the key modern strategies to ensure resilience of the solutions. Its concept of fitness functions can be used to measure the key quality attributes of the solution and to plan the resilience improvements implementation.
Yes, I’ve just scratched the surface of the resilience of digital solutions. Each of the points addressed briefly above could benefit from an entire series of articles or a book.
All the elements of business features come at a price, which is much higher than the direct cost of design and implementation. And, this is just the visible part.
Resilience limits some options, both from the functional and technological angles. It requires additional time and money to make sure the solution will be manageable and future-proof.
“Breaking things and moving faster” used to be a motto for many startups and experimental projects. Not taking care of the resilience of the solution or entire strategy will backfire. It’s just a question of when and how.
Your perfect UX and great product idea can be destroyed in days by application bugs, outages, and data leaks.
Customers, in general, are already tired of the “beta culture” where everything is in an unfinished and unreliable state; they no longer accept it.
The old truth of IT projects is that only a few people would know about the additional features that were removed in the last phase, but everyone will notice bugs, outages and data leaks when those features were overly prioritized.
Still, very often the priority is just on the ime and feature set, at the expense of the solution’s resilience.
Minimum Viable Products (MVPs) in its truest form is a great idea that delivers digital products faster and is able to test different business ideas while benefiting from the features already available for the users.
→ Feature toggles – faster digitalization by smart experimentation
However, like many great ideas before and after, it was spoiled and MVPs were turned into unstable, unworkable and functionally limited prototypes. Usually, the nice UI layer struggled to cover the critical bugs swarming under its beautiful silky skin.
Therefore, it’s much better to embrace the idea of a Minimum Likeable Product, while remembering that it’s hard to like something that does not work and generates frustration. No amount of a great paint job can cover up the broken engine in the digital journey of the user.
Resilience is something that is often expected to be there without expressing any requirements in the area. It’s the “obvious” job of IT professionals to make sure it will all work, and that there will be just a few bugs and issues facing our employees and customers.
The problem is that resilience does not bring as visible a value as the user experience. The UX is more visible, can be felt, is understood, and it has colors and shapes.
Resilience is something that mostly takes place in the background, as it is done without so much noise and does not bring so much attention and… budget.
Until … the solution fails.
It’s recommended to define and deliver the expected level of resilience, and all the participating stakeholders and their teams should be aware of what resilience mechanisms are planned and that have already been implemented.
Make resilience a visible and well defined set of requirements, as early as possible.
Resilience of digital solutions is in fact a new name for the group of non functional requirements, many of which are decades old.
But, it’s much more than a new name and grouping. Solution resilience, being so high on the list of expectations, is a sign that it has been recognized as a key component of business resilience and business efficiency.
There are new strategies, techniques and tools which enable IT to deliver on the promise of resilience. The new flexible architectures, increasing demand for higher velocity of the teams, and demands for new solutions, all ensure that digitalization programs move forward faster.
Resilience requirements do not live in isolation and cannot be treated as the end goal, because they are part of a proactive IT risk management strategy. The key is to find the right balance between experimentation and innovation vs. resilience that builds trust.
Is there a cookbook for that? There are so many, but each project, product and organization has a different context.
Let’s take a look at it together with Avenga.