This is the second post of our Federated Learning (FL) series. In our previous post, we introduced FL as a distributed machine learning (ML) approach where raw data at different workers is not moved out of the workers. We now take a dive into Databloom Blossom, a federated data lakehouse analytics framework, which provides a solution for federated learning.
The research and industry community have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated , Flower , and OpenFL  are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface.
What Is the Problem?
Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the trained model size. All of the equal importance. Among all of these open problems, there is one of crucial importance: supporting end-to-end pipelines. Currently, users must have good knowledge of several big data systems to be able to create their end-to-end pipelines. They must know from data preparation techniques to ML algorithms. Furthermore, users must also have good coding skills to put all the pieces (systems) together in a single end-to-end pipeline. The FL setting only exacerbates the problem.
Databloom Blossom Overview
Databloom offers Blossom, a Federated Data Lakehouse Analytics platform to help users to build their end-to-end federated pipelines. Blossom covers the entire spectrum of analytics in end-to-end pipelines and executes them in a federated way. Especially, Blossom allows users to focus solely on the logic of their applications, instead of worrying about the system, execution, and deployment details.
Figure 1 illustrates the general architecture of Blossom. Overall, Blossom comes with two simple interfaces for users to develop their pipelines: Python (FedPy) for data scientists and a graphical dashboard (FedUX) for users in general.
"Blossom enables users to easily develop their federated data analytics in a simple way for fast execution."
In more detail, users specify their pipelines using any of these two interfaces and Blossom, in turn, runs them in a federated fashion using any cloud provider and data processing platform.
Listing 1 above shows the simple WordCount application in Blossom. The first three lines allow the user to register the platform to use in Blossom (Java program and Spark in our example). The remaining lines of code are the actual WordCount program. The beauty of Blossom is that the user does not have to decide on which data processing platform to run the program (Java or Spark). Blossom decides the actual execution based on the input dataset’s and processing platforms’ characteristics (such as the size of the input dataset and the Spark cluster size). It can do so via an AI-powered cross-platform optimizer and executor.
At its core, we can find Apache Wayang , the first cross-platform data processing system. Blossom leverages and equips Apache Wayang with AI to unify and optimize heterogeneous (federated) data pipelines as well as to select the right cloud provider and data processing platform to run the resulting federated data pipelines. As a result, users can seamlessly run general data analytics and AI together on any data processing platform. Blossom’s optimizer mainly provides an intermediate representation between applications and processing platforms, which allows it to flexible compose users’ pipelines using multiple processing platforms. Besides translating users’ pipelines to the underlying processing platforms, the optimizer decides on what is the best way to perform a pipeline so that runtime is improved as well as on how to move data from one processing platform (or cloud provider) to another one.
Blossom also comes with a cloud-native executor that allows users to deploy their federated data analytics on any cloud provider and data processing platform. They can choose their preferred cloud provider/data processing platform or let Blossom select the best cloud provider (data processing platform) based on their time and monetary budget. In both cases, Blossom deploys and execute users’ federated pipelines on their behalf. More importantly, the executor takes care of any data transfer that must occur among cloud providers and data processing platforms. While the optimizer decides which data must be moved, the executor ensures the efficient movement of the data among different providers and data processing platforms.
Blossom, a Federated Data Lakehouse Analytics Framework
Thanks to its design, optimizer, and executor, Blossom can provide a real federated data lakehouse analytics framework:
- Heterogeneous Data Sources. It can process data from (or over) multiple data sources in a seamless manner;
- Multi-Platform and Hybrid Cloud Execution. It automatically deploys each sub-part of a pipeline to the most relevant cloud provider and processing platform in a seamless manner to reduce costs and improve performance;
- Federated Machine Learning and AI. It comes with his own framework (including a parameter server) to run pipelines in a federated manner;
- Ease of use. It allows users to focus on the logic of their applications by taking care of how to optimize, deploy, execute their pipelines.
 TensorFlow Federated: https://www.tensorflow.org/federated
 Flower: https://flower.dev/
 OpenFL: https://www.openfl.org/
 Apache Wayang: https://wayang.apache.org/