Service mesh – upgrade your microservices architecture to avoid a service mess

service mesh

Why service mesh?

Microservice architecture is the most popular type of architecture nowadays.

It was embraced both by the startups as well as enterprises in all business sectors.

This architectural choice is supported by the proper software architecture. It started with the decomposition from virtualization to containers, then from containers to advanced container fleet management. You guessed it, Kubernetes is the hottest technology for enterprises right now.

→ Read more about Kubernetes – how hot can it get?

What is the next level for microservices infrastructure support?

Aren’t API gateways and Kubernetes enough?

Decoupling service communication from service implementation

Early adopters of microservices used proprietary frameworks bound to their programming environment, such as Java, Node, and DotNet. 

This is not acceptable anymore, especially when services are written in different languages and run on different runtime environments.

Functionalities such as load balancing, circuit breakers, service discovery, retry logic, etc. have become tightly coupled with the development environment. 

Coupling is almost always the wrong idea, but we are used to dealing with it on a functional level with microservices paradigm, as well as on the API level with tools such as couper.io.

→ Learn about Generic API or Back-End for Front-End? You can have both.

The infrastructure has become the concern of developers. The focus of development teams shifted too much from the actual business logic to managing the infrastructure of microservices. The infrastructure should be decoupled from the business logic and attention of developers.

Microservices decouple application functionalities while service mesh architecture decouples microservices from infrastructure concerns. It’s a classic example of the separation of concerns, which is one of the key mega patterns of software development.

To end blindness and enable visibility

Another problem is the visibility of services communication and the ability of the operations team to monitor and observe it in real time.

It is usually well addressed on the API edge level when communicating with external API consumers and providers, but much worse internally.

The dream of communication graphs has become common among DevOps teams.

What service mesh delivers

Service mesh does not have a clear single definition. It’s not a lone technology or pattern. It’s a set of patterns and supporting technologies. Let’s start with what service mesh offers.

Observability

Knowing what is happening in a complex microservices communication graph is very important. All IT teams want to quickly find the causes of the improper behavior of their microservices and their configurations. This is the key to quickly detecting and fixing problems in order to provide effective performance and the reliability of digital solutions.

Being able to see service-to-service communication in real time and analyze it helps to discover any dependencies between multiple services and optimize the infrastructure for better performance and reliability.

Traffic management

Features such as circuit breaking, load balancing, service discovery and timeouts management are provided by service mesh.

It enables decoupling those features from various client libraries and runtimes which are not suitable for organization level service management purposes.

Logging

Logging is a baseline functionality, but in the case of service mesh, it’s the logging of all the communication from a new point of view. Not from the individual-service point of view and not only the service-orchestrator point of view, but the entire view.

Metrics

One of the benefits of service mesh is to remove coupling between the code and the metrics. The other is to be able to optimize globally, because of the uniformity of the metrics.

Latency, performance, time to first byte, etc. enable better auto scaling beyond the scope of a single service orchestrator.

Tracing

The idea is, again, to decouple tracing from services. Let the tracking be done in a uniform and consistent way at the level of service mesh. It makes tracing almost effortless compared to manual instrumentation that has to be added to the code.

There are of course cons to this approach, but in my opinion the low entry barrier and uniformity are the winning factors.

Traffic control

Resiliency related features such as circuit breakers, retry policies, latency-aware routing, timeouts and deadlines were too often done in multiple locations in different ways.

What service mesh offers here is, again, uniformity and decoupling of the traffic management policies from the service implementation configuration.

Plus, it enables deadlines which are feature level kinds of timeouts, that are closer to actual business functionality. It is best suited to higher level management solutions such as service mesh.

Rate limiting and throttling are usually based on policies which are managed independently from services and service orchestrators. 

For instance, limits are higher for authenticated users, even higher for paying subscribers and very limited for anonymous users.

In case of very high traffic, service mesh may even temporarily block traffic for anonymous users to preserve the capacity for more important customers.

Security

Service mesh can enable protocol based security between internal and external services. It can be based on client and server authentication that are based on certificates. Without service mesh, it’s hard to achieve in the entire organization as well as at the edge of APIs with external partners.

Automated TLS and certificates make the internal service communication much more secure with minimal management cost.

Many companies simply give up on enhancing their local security of services because it’s too complicated and expensive. Some security analysts call the typical approach the soft underbelly of security, which is focused mainly on perimeter security and not so much on the inside. With service mesh it’s still an effort, but often it becomes  an acceptable investment rather than a security burden.

Protocol translation

It’s not only HTTP and JSON, as gRPC is growing in importance because of its performance and reliability. Different services can consume and expose APIs in different protocols, especially internally. Service mesh can translate gRPC into JSON and back, helping to integrate better internal services with external API gateways.

Both internal and external

What is interesting and very important to remember about service mesh, is that it all applies both to external (north-south) traffic as well as for internal traffic (west-east). It is especially important for large organizations with thousands of services and complex cross-domain, cross-department, and cross-country integrations.

Data plane and control plane

Service mesh implementation is usually divided into two planes:

  • Data plane – doing actual work, such as message routing, encryption, discovery, etc.
  • Control plane – administration, configuration and control of the data plane

Leading service mesh solutions

Service mesh has actually been with us for years; because of internet giants with billions of service instances running all the time across the globe. Now the lessons learned by them are available for large and middle sized enterprises.

Istio is often referred to as the “second explosion after Kubernetes”, which seems to be the most popular service mesh solution. The combined score of ‘evaluating’ and ‘in production’ is the highest score per the Cloud Native Computing Foundation (CNCF) survey, however, the actual production usage is lower than Conduit.

Linkerd is advertised as the lightest and the fastest service mesh for Kubernetes. According to the CNCF survey, it’s the most “evaluated” service mesh with visible albeit small usage in production. 

Conduit is another lightweight service mesh that was so successful that it was announced as the base for Linkerd 2.0. According to the surveys from CNCF, it is the most popular service mesh in production, as of the fall 2020.

There are others, like developers in this case, along with the DevOps people who will decide which ones will be the most popular and evolve faster.

Will the combined power of Conduit and Linkerd win the hearts and minds of the DevOps community over Istio? Only time will tell.

Our team at Avenga is leaning towards Istio, however, we are remaining open to other options.

It’s good to have competing solutions in this area as it speeds up innovation and the availability of new features.

Adoption reality

Service mesh seems to be very promising for today’s world, as well as the future, of microservices and functions as a service. Cloud native proponents already benefit from Istio, Linkerd, Conduit and other solutions.

Service mesh vs. client libraries

Microservice frameworks such as Netflix Hystrix, Netflix Ribbon, and Twitter Finagle have become synonymous with microservices management.

They are proven in action and knowledge about them is relatively high. However, they do represent the problem which service mesh addresses. And, the problem is coupling service implementations and configurations with cross service concerns. Service mesh is helping to decouple these concerns in order to enable more control and flexibility. 

And most importantly, to help developers focus again on the business value their code delivers instead of fighting with infrastructure issues.

Service mesh vs. API gateway

API gateways are usually focused on so called north-south traffic. This means traffic from within your organization to the external world, and vice versa. 

West-east traffic is usually treated as secure, not requiring as much advanced ‘mechanics’ as the north-south traffic.

With service mesh it changes and all the traffic directions are managed, monitored, observed and secured by a central solution.

API gateways are not going anywhere, as they work together with service mesh.

→ Explore also Asynchronous API standardization. AsyncAPI for faster and better digital communication

Service mesh vs. container orchestrator

Isn’t a container orchestrator, such as the famous Kubernetes, enough?

Kubernetes and similar tools are providing basic functionalities at a lower level of the infrastructure.

Often quoted service needs that are missing are: circuit breaking, granular traffic routing, chaos testing, canary deployment, per-request routing, backpressure, transport encryption, access control, quota management and policies.

Is there an overlap? Yes, definitely.

Is that a problem? It means more choices and with proper governance (which features to use from each solution) it is the only way it is beneficial.

When to adopt service mesh

The key factors are the following:

  • When you have hundreds or thousands of APIs to manage and monitor
  • If you need internal service telemetry end monitoring
  • If you need to enforce internal security in the spirit of Zero-Trust Architecture paradigm
  • If you want to decouple and centralize all the aspects of service traffic control, observability, fault monitoring, security, and distributed debugging
  • If your traffic management policies change often and should not be coupled with microservices logic
  • If your advanced traffic management policies have also been applied to internal traffic control

New skill set required

Currently, the DevOps skill set is very needed. With service mesh, there’s another layer of DevOps to be implemented, managed and improved over time. It’s an additional skill set for the team. Kubernetes and cloud native stack are complex enough which makes it hard to find the right people. Service mesh adds another layer of complexity for DevOps to make things simpler for developers.

It’s another infrastructure related investment.

New mind set required

To use service mesh effectively, both sides – operations and developers – have to understand the implications of using service mesh. For example, developers should not play with the infrastructure anymore. Still, they need to be aware of the way it works to be able to use it effectively.

But wait a moment, weren’t we supposed to decouple and not think about it anymore?

To some extent, yes, however hiding from higher level abstraction is never a good thing. So, education is needed both for developers and operations.

I want it all and I want it now

This isn’t  a good idea for a service mesh introduction. All the experts recommend a step by step approach, with multiple small steps. However, there are benefits of service mesh that can be achieved earlier, before the ‘full’ implementation of this paradigm. 

Number one is the better observability and communication graphs. Increased visibility enables better optimization and troubleshooting of microservices. It’s definitely the most attractive selling point of service mesh. 

Then you can decide what is more important, better quota traffic management, security, etc, and it can be enabled and introduced gradually step by step, and no revolution is required this time.

It’s also this kind of approach for service mesh that our DevOps experts propose for our existing and future business partners.

Future of service mesh

The future is bright for service meshes.

Bringing more visibility, security and control for microservice based enterprises is something that has become an urgent need.

What we love about service mesh is the fact that they enable safe and smooth introductions. Solutions have grown mature and there are many lessons being learned.

Get in control of your microservices with service mesh!

Get in control of your microservices with service mesh!
We’d like to hear from you. Please use the contact form below and we’ll get back to you shortly.
Back to overview