Avenga’s response to the war on Ukraine: Business Continuity and Humanitarian Aid
Director of Avenga Labs
Table of content
Most all companies want to become successful data driven companies sooner than later. There’s a big expectation that data will help to optimize business efficiency and better target customers with more personalized products and services. Often companies invest truckloads of money into data projects and machine learning (ML) projects, and the effectiveness of those projects is of paramount importance to them.
Companies usually build sophisticated data flows from the bottom up, which extract data from their own well known sources into data lakes, then they are transformed into centralized data storages and warehouses that are managed globally. Of course, there are new ideas, such as data mesh for instance, but they are novelties at the moment.
The idea behind this is that no data should be hidden from the opportunity of using it to extract valuable information, which in turn will help business to be more efficient and successful. Even if there’s no idea of what to do with most of this data at the moment (estimations are up to 95% of the data in particular cases such as cars), the data scientists, including citizen data scientists, will soon figure it out.
And, the benefits of this approach are great.
First of all, a significant benefit is having the highest possible level of control of the data, its structure, statistical parameters, and distributions. Data quality management rules can be applied to all the data sources and transformations. Machine learning projects are hard and iterative, and always experimental, but one great advantage of centralized machine learning is reducing any risks related to proper data management.
Centralized machine learning processing also enables better scalability in the training of models along with better computing resource utilization, testing and management. New technologies, such as ML Flow which enable ML Ops, are also of great interest and help.
From a machine learning perspective, being able to train and immediately apply models in the largest possible data sets is a great asset to better optimize the models. In many cases, it will decide about the go or no go for particular models, as the business relevant accuracy targets are always hard to achieve.
The key points here are that you cannot do centralized machine learning in every business scenario and also that there would be a lot of lost opportunities if there was no other way to do it, other than centralized machine learning.
Centralized machine learning assumes access and control of all the data of the organization and related partners.
Well, there are scenarios where this is not possible for regulatory reasons. For example, different medical service providers are not allowed to access each other’s data, copy it or store it in their separate data infrastructures. Yet, the sum of this data is a very valuable asset which may benefit not just a single entity, but all the partners involved. Another example can be related to industrial applications of IoT, as devices have their own data and processing capabilities in their edge nodes, which can be used to train and refine machine learning algorithms.
The alternative of sending all the data from all the devices to a central location does exist, but with power and connectivity limitations very often is not a viable option. Before federated machine learning, the data had to stay on the devices and be a wasted data opportunity.
→ Have a look at Continuous Delivery for Machine Learning (CD4ML)
Models trained on larger and more complex datasets will be much better for any of the participants.
Federated machine learning is about taking advantage of separate data sources in order to build better models than each particular source would allow individually.
In the case of the very popular Deep Neural Networks (DNNs), it means models are trained locally on multiple processing nodes with local data only, without accessing data from the other nodes. The models themselves and their tuning parameters are frequently exchanged by different nodes to help optimize model performance and they react accordingly to the changing output. In the case of DNNs, network weights and hyper-parameters are exchanged between nodes with specialized coordinators and synchronization routines.
Does it sound a little bit like distributed learning which we can apply in a classical centralized machine learning flow? It actually uses the evergreen idea of dividing and conquering the larger problem by separating it into smaller units and then combining the results together. Partially this is true, as federated learning is always distributed, but let’s specify the key differences between distributed and federated.
First of all, in the case of distributed machine learning, the data sources are divided into very similar units, with similar sizes, data characteristics, and guaranteed data schema consistency. In the case of federated machine learning, the data sets can differ significantly in size (orders of magnitude and more) and they may even not be fully heterogeneous, as this assumption is hard to achieve. Nodes in federated learning may contain totally different data distributions and may come from different business sub areas, countries, or devices which analyze images, but images of different objects in different contexts.
Certainly, federated is much harder than the already complex distributed machine learning.
Being iterative is the fundamental property of federated learning, as federated learning rounds are the key processing units. Each round starts with distributing the existing model to all the nodes, which in itself can be a challenge as nodes may be far away from each other, with different edge processing technologies and of course, again, different local datasets. New configurations are also applied to the nodes and the nodes respond with messages that may contain status information, but the key is to exchange models and parameters between the nodes, and back to the coordinator.
In the simpler case, the entire process is managed by the master server which is orchestrating the workflow and analysing the results, and reacting accordingly. In the simplest case, this is synchronized communication with easily identifiable synchronization points and relatively high consistency.
However, the nodes differ also in their performance because the latest research ideas embrace more asynchronous communication in order to optimize the entire learning process and to not bow down to the slowest and less efficient nodes all the time. Of course again, it means additional complexity, but you gain the benefit of faster learning for the entire model.
In regards to synchronous or asynchronous models, well, we are still talking about the simplest case of centralized coordination of federated learning. (It’s not centralized machine learning, but federated learning orchestrated by the central server).
In decentralized federated learning scenarios, the nodes are responsible for their coordination and the idea is to avoid a single point of failure and to have constant communication with HQ, as well as to replace it with a modern set of self-electing leaders within the nodesets and clusters. It’s the same idea that was successfully embraced by the largest cloud providers in their internal infrastructures in order to avoid bottlenecks of central coordination at the expense of some node coordination failures.
This assumption is almost never true in the case of federated learning, as non-iid data is and always will be a challenge to address within federated learning solutions. Local nodes store data with very different statistical distributions (covariate shift), store labels which have different statistical distributions than other nodes, and in addition, their labels may correspond to different features and the same features may correspond to different labels. And, let’s add to it the unbalanceness of nodes in terms of data size. It creates a lot of complexity to be addressed by the data scientists, tools, and workflow management.
→ Avenga Data science perspective on Covid-19: a real life example
One of the most important elements is the communication between nodes. On one hand it must be as efficient as possible, however on the other hand, the communication capabilities of nodes may be very limited in terms of latency, bandwidth and power consumption. This is not a local data center with its fast 10+ Gb connections with almost zero latency and very high reliability and redundancy.
The solution comes from the new generation of flexible node to node communication algorithms which requires less communication than standard protocols and orchestration engines.
It should be noted that nodes are not guaranteed to perform, some may be out, and some will train much faster than others, but this is the nature of federation. Nodes are not all the same.
Another problem is the local bias per node. Bias is already a very hard problem for machine learning even with a simpler centralized model; but in case of nodes with different data it becomes one level harder.
Each node may have different data corrections applied with different data quality degradations, which cannot be assessed and controlled on the global level (as we may not access data directly from the nodes).
Plus, each node may have different regulatory requirements for data, as they may come from different regulatory arenas.
I think the impression of federated machine learning now is that it is hard. Yes, it has to be, but this is the nature of the problem. But, there are great advantages which make companies invest in these new technologies. There’s time for a small POC, but there’s also time for something more complex and I’m glad companies are continuing to create better ML models despite the difficulties.
The first big advantage is the ability to not violate the data protection regulations, both general and sector focused. It is an enabler for better models for everyone.
The privacy of nodes is guaranteed by default, for instance, no personal data is leaving or entering the nodes at any time, so federated learning has privacy by design. The key fundamental concept here is about data separation.
Machine learning parameters are also encrypted to prevent the ability to discover too much about the underlying data, yet the parameters themselves might leak too many data characteristics to be considered fully private. Another problem is running estimates but this has been addressed by introducing Static Batch Normalization (sBN).
Another advantage of federated models comes from the autonomy of the nodes. Any node may combine common and shared ML models with their local data characteristics that in turn create hybrid models which will deliver more personalized results than the general models. There’s an entire area of research and development regarding this personalization in federated models.
Legal regulations for data protection can be met despite the fact that the nodes may be in different regulatory zones (US, EU, etc.) and each of these laws would exclude the scenario of combining all the data together in one unified data store. With federated learning, the models can be created and maintained without violating any of these laws, and they will be of great benefit to all the parties involved.
In conclusion, federated learning enables us to benefit from data to which access would not be otherwise possible. It is a very smart idea and implementation, especially in our divided world where regulation is only getting stronger and more separated. Before this, data scientists could only focus on local data sets.
→ AI & Machine Learning in Finance: the Whys, the Hows and the Use Cases
The area of federated learning is new and developing very rapidly, there’s a lot of research, and more and more are ready to use patterns and tools. The future is bright for federated learning.
→ Explore Avenga data and data science services