Federated Learning (Part I): Old Wine in a New Bottle?

Dr. Jorge Quiané
Federated Learning (Part I): Old Wine in a New Bottle?
January 11, 2021

Machine Learning (ML) is a specific subset (branch) of Artificial Intelligence (AI). The main idea of ML is to enable systems to learn from historical data to predict new output values for input events. The beauty is that ML does not require systems to be explicitly programmed to achieve so. All this with little human intervention.With the growing volumes of data in today’s world, ML has gained unprecedented popularity. We can achieve today what it was unimaginable yesterday: from predicting cancer risk from mammogram to polyglot AI translators. As a result, ML has become the key competitive differentiator for many companies, leading ML-powered software to quickly become omnipresent in our lives. The key in ML is that the more available data the better the accuracy of the predictive models.


The Appearance of Distributed ML

While ML has become a quite powerful technology, its hunger for training data makes it hard to build ML models in a single machine. It is not unusual to see training data size in the order of hundreds of gigabytes to terabytes, such as in the Earth Observation domain. This has created the need to build ML models over distributed data over multiple storage nodes.

Distributed ML aims at learning ML models using multiple compute nodes to cope with larger input training data sizes as well as to improve performance and models’ accuracy [1]. Thus, distributed ML help organizations and individuals to draw meaningful conclusions from vast amounts of training data. Healthcare and advertising are only two examples of the most common sectors that highly benefit from distributed ML.

There exist two fundamental ways to perform distributed ML: data parallelism and model parallelism [2]. Figure 1 illustrates these two approaches for distributed ML.


data parallelism and model parallelism


In the data parallelism approach, the system horizontally partitions the input training data, usually it creates as many partitions as available compute nodes (workers), and distributes each data partition to a different worker. Then, it sends the same model features to each worker, which, in turn, learns a local model using their data partition as input. The workers, then, send their local model to a central place where the system merges them into a single global model.

The model parallelism approach, in contrast to data parallelism, partitions the model features and sends each model partition to a different worker, which in turn build a local model using the same input data. That is, the entire input training data is replicated in all workers. Then, the system brings these local models into a centralized place to aggregate them into a single global model.

"Yet, although powerful, distributed ML has a core assumption that limits its applicability: one needs to have control and access over the entire training data."

However, in an increasing number of cases, one cannot have direct access to raw data and hence distributed ML cannot be applied in such cases, for example in the healthcare domain.


The Emergence of Federated Learning

The concept of FL was first introduced by Google in 2017 [3]. Yet, the concept of federated analytics/databases date from the 80’s [4]. Similar to federated databases, FL aims at bringing computation to where the data is.
Federated learning (FL) is basically a distributed ML approach but, in contrast to traditional distributed ML, raw data at different workers is never moved out of the workers. The workers own the data and they are the only ones having control and direct access to it. Generally speaking, FL allows for gaining experience for a more diverse set of datasets at different independent/autonomous locations.

Ensuring data privacy is just crucial in today’s world when societal awareness for data privacy is rising as one of the main concerns of society. For example, many governmental organizations have written laws, such as GDPR [5] and CCPA [6], to control the way data is being stored and processed. FL enables organizations and individuals to train ML models across multiple autonomous parties without compromising data privacy. During training, organizations/individuals share their local models to learn from each other’s local models. Thus, organizations and individuals can leverage others’ data to learn more robust ML models than when using their own data only

"The beauty of FL is that it enables organizations and individuals to collaborate towards a common goal without sacrificing data privacy."

Multiple participants collaboratively train a model with their sensitive data and communicate among them only the learnt local model. Figure 2 illustrates the general architecture of FL.

FL also leverages the two fundamental execution modes to build models across multiple participants: horizontal learning (data parallelism) and vertical learning (model parallelism).


general architecture of enterprise federated learning


Conclusion

FL is a powerful technology that allows organizations and individuals to collaborate towards the same goal without sacrificing data privacy.

"Old wine in a new bottle? We can at least conclude that FL is a mixture of federated databases with distributed ML."

Yet, there are a few aspects that make FL unique:

In contrast to distributed ML, FL prevents organizations or individuals to access data from other organizations/individuals.

  • FL is geo-distributed in essence while distributed ML is an on-premise technology.
  • One of the main goals of FL is safeguarding data privacy while this is a nice-to-have feature in federated databases. Distributed ML does not care it because assumes full control of the data.
  • While federated databases assume a relational data model, FL does not make any assumption on the underlying data model.

About Databloom

Databloom is a software company that has developed a powerful AI-Powered Data Platform Integration as a Service platform called Blossom Sky. This platform enables users to unlock the full potential of their data by connecting data sources, enabling generative AI, and gaining performance by running data processing and AI directly at independent data sources. Blossom Sky allows for data collaboration, increased efficiency, and new insights by breaking data silos in a unified manner through a single system view. The platform supports a wide range of ML and AI algorithms and is designed to adapt to a wide variety of AI algorithms and models.

References

[1] Alon Y. Halevy, Peter Norvig, Fernando Pereira: The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 24(2): 8-12 (2009).
[2] Diego Peteiro-Barral, Bertha Guijarro-Berd iñas: A survey of methods for distributed machine learning. Prog. Artif. Intell. 2(1): 1-11 (2013).
[3] Brendan McMahan, Daniel Ramage: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Google AI Blog. April 6, 2017.
[4] Dennis Heimbigner, Dennis McLeod: A Federated Architecture for Information Management. ACM Trans. Inf. Syst. 3(3): 253-278 (1985).
[5] General Data Protection Regulation (GDPR): https://gdpr-info.eu/
[6] California Consumer Privacy Act (CCPA): https://oag.ca.gov/privacy/ccpa

back to all articlesdatabloom.ai blog rss feed
For details or building a customized plan please contact sales.

Get Started

Want to get started on your own? Apache Wayang is open source and ready for you to start building your federated data processing engine.
Get Apache wayang
Apache Wayang