Director of Avenga Labs
Businesses demand that software solutions are resilient, which means at least stability and a low number of errors.
On the other hand, they want even more flexibility and development velocity. That means the ability to change different parts of the system, but to be able to use readily available APIs both internally and externally.
These two families of requirements contradict each other, so solution architects have to keep on balancing them in order to find the right solution and keep it all at an adequate level in regards to the changing requirements and environments.
Sometimes the resulting architectural complexity is a painful side effect of over design and future proofing for the unknown future, which leads to artificial and unnecessary complexity.
→ Explore Microservices and if they are the last resort
No matter if the complexity was real or self-inflicted, the systems nowadays are almost always distributed and they are harder to maintain and control using traditional mechanisms.
One of the solutions that is expected to help to restore and maintain control of the service-based microservices and serverless systems of today is observability.
It’s one of the most popular trends now and virtually everyone supports it, however, the opinions differ about how. Add, there’s no one in IT questioning the need for observability.
Let’s start with the goals.
The very word observability was borrowed from an automation theory and the goals are similar, but the implementation of it is naturally different in the software development world.
The goal is to know in real time what is happening in the thousands of API components collaborating with each other as they deliver business functionality.
However, being able to track it and measure it is not the goal in itself, of course.
This is the most basic goal and sometimes the most important, especially when the system suffers from lower than expected quality or when we are faced with multiple external dependencies that we cannot control.
Bugs directly affect the user experience (and customer satisfaction as the immediate result), operational efficiency, and the image of the company. Nobody wants bugs. And even when we accept their inevitability, the time when users are faced with broken features should be as little as possible.
In the case of an old type of system with layered architecture (UI, logic and database), it was relatively easy to track which component failed and what was the cause of the bug. The log locations were well known and everything was under control, at least compared to what is happening now.
With hundreds or thousands of APIs distributed across many machines and even within different clouds from different vendors, it has become much more difficult to find the failing component, call context and user activity flow.
Observability is expected to help with that.
Observability is also helpful when planning infrastructure, as it delivers answers about which API services are used more extensively and require higher scalability, and those which are rarely used, as well as those that are not used at all.
We’ve touched on this subject already in our feature toggles article, so check it out.
Observability can help with feature management by providing valuable information about the usage patterns of different APIs, even different versions of those APIs. It can identify which APIs are more popular and which are candidates for deprecation.
What is visible on the surface (UI mostly) is the result of multiple cross component communications which took place in the lower layers of the system. Observability can help to detect failures and quality issues which are not visible. It enables IT teams to fix those invisible issues before they surface and become a significant problem for system availability and the user experience.
This part is more often combined with chaos engineering, which generates unusual traffic patterns to detect weaknesses in system design and infrastructure.
There are discussions about what the key components of observability are, but the recurring theme is that it should contain at least three elements: metrics, logs and tracing. Observability is more than the sum of its components, but let’s take a closer look at the components first.
Metrics measure different quality attributes of the system.
For instance average API call calls: the time it takes to interact with the application web page or how long it takes to retrieve user profile information from the database, etc.
Others measure the frequency of eventsy. For example, the number of API errors per hour, the number of container crashes and restarts, etc.
Metrics are used to verify that everything is within the defined thresholds, but also to react early to performance and stability degradation before it becomes critical and endangers the user’s experience and thus business performance itself.
Logs are almost as old as business applications and operating systems. Selected events are written down to log files or other structures with their timestamp, source, and type (information, warning, error).
The verbosity and type of events should be carefully selected to be both helpful and not overwhelming.
In the case of observability, aggregated logging is what is needed, which means combining multiple individual logs into centralized and correlated logs.
Tracing is about following the flow of API calls, which can be triggered by other APIs, external events and, of course, by user interaction.
We covered this in more detail when the OpenTracing standard was analyzed by Avenga Labs. To summarize, tracing requires building in aspect features into application components, but in return it gives a powerful tool to trace component interactions while preserving the context.
All these elements can be transformed into fitness functions which measure the quality and trends of entire architectures. This is the DNA of evolutionary architecture which enables architects to both react to architecture quality degradation and plan systematic improvements and optimizations.
When there’s a lot of data, the opportunity for valuable insights arises. The tools usually have some features built in like aggregations or simple correlations, but the vast amount of data almost begs to be processed using modern data science techniques.
→ Download an ebrochure with use cases
This is often something that seems to be underappreciated.
The tools are there, both open source and commercially available, and both can be installed locally or as part of the cloud environments provided. We tend to hear the question why tool A and why not tool B. Selecting the tools, setting them up, and preparing pipelines and applications seems to steal all the focus.
But, then comes the (too often forgotten) moment – “So what?”
Who or what and when and how are we going to react to all this observability data? How do we check if anything or anybody is even noticing all those alerts and trends generated by the tools?
With so much data gathered and dataset sizes increasing all the time, it is a challenge to understand what is going on, what information can be discarded as not important, and which requires more or less immediate attention.
There’s a growing hope that AI, and machine learning in particular, will help to deal with the ocean of data, automatically generate alerts and fill out bug reports for humans to fix the problems, and perform preventive actions to avoid incoming problems. For this, there are already solutions that analyze the data and use classification algorithms to determine the severity and importance, so as not to bother people too much on one hand and not to miss important information on the other.
Always the process has to be complete, as the tools and gathered data is not an end goal.
Designing and implementing an observability framework, which would cover complex digital products, is time and cost consuming.
The question, which always should be asked, is about return of investment (ROI). Something that agile fanatics wanted us to forget, but it doesn’t want to go away.
How much more efficient will the bug detection process become? Can we reduce direct maintenance costs? How much does it cost to maintain an observability framework?
All this instrumentation consumes network, CPU, memory and disk storage resources. Even with proper management of logs, additional resources have to be provisioned and administered to make room for observability runtimes and data.
Logs and traces may contain personally identifiable information, which is not acceptable in the majority of real world cases. If not taken care of, it may pose a great compliance and privacy violation.
There are tools that can be used to detect this type of information and they are a nice addition to the process, which is supposed to ensure this kind of information is not saved in the logs in the first place.
Tracing, API gateways, service meshes, logging, measuring, etc. causes a lot of confusion, which also makes it harder to compare the tools. When reviewing the observability suites, there’s always this part about basic terminology, which explains what the understanding of the terms is by the particular vendor or open source organization. Of course, there are attempts at standardization ( OpenTelemetry, for instance) but the influx of different products and hype does not help with this.
The current set of technologies is getting better at responding to the question, from a technology standpoint (both software and infrastructure).
As all the companies are quickly becoming software driven companies, the answers related to software are actually getting closer to the actual business processes themselves.
This requires end to end observability across the entire organization, to include business partners. It is becoming more popular, however it’s still a very ambitious goal to have.
Data lineage is a very interesting tool that tracks what is happening to the data across the organization.
However in this context, we mean applying observability to the data processing pipeline, including machine learning models and their execution, in order to measure and verify their accuracy and prevent degradation, plan adjustments and improvements. This is, of course, related to Continuous Delivery For Machine Learning (CD4ML) and machine learning operations (MLOps).
It is not just about running software, but also the CI/CD pipelines should be observable. They are also specialized software with scripts, configuration and data.
Another integration scenario is trying different versions of software and, thanks to observability, comparing which version (for instance, which version of CPU intensive algorithm) behaves and performs the best.
Observability seems to be all the rage right now in IT. And, it is fully justified by the growing demands of business for IT in order to remain in control despite the need for greater flexibility and the growing complexity of digital products.
Entire concepts and solutions, such as services meshes, were built with observability as one of their main goals. It’s best to add it as part of the broader evolutionary architecture transformation towards composable enterprises. Tools are important because they deliver both solutions, as well as inspirations about the process and its capabilities. But, they should be placed in the right context of the process. Actionable insights and resulting improvements are the true values of observability.
Are you still unsure how to benefit from it in a particular IT context? Contact Avenga, let’s put the actual value of innovation under the microscope together.