Deep tech dive into Scalytics and Wayang

April 12, 2022
-
Dr. Jorge Quiané
-

Machine Learning (ML) is a specific subset (branch) of Artificial Intelligence (AI). The main idea of ML is to enable systems to learn from historical data to predict new output values for input events. ML does not require systems to be explicitly programmed to achieve the goal. With the growing volumes of data in today’s world, ML has gained unprecedented popularity; we achieve today what was unimaginable yesterday: from predicting cancer risk over mammogram images and patient risk data to polyglot AI translators, for example. As a result, ML has become the key competitive differentiator for many companies, leading ML-powered software to quickly become omnipresent in our lives. The core key of ML is data dependence; more available data enables much better accuracy of predictive models built by ML.

The Appearance of Distributed ML

While ML has become quite a powerful technology, its hunger for training data makes it hard to build ML models with a single machine. It is not unusual to see training data sizes in the order of hundreds of gigabytes to terabytes, such as in the Earth Observation domain. This has created the need to build ML models over distributed data, stored in multiple storage nodes across the globe.

Distributed ML aims to train itself, using multiple compute nodes to cope with larger input training data sizes as well as improve performance and models’ accuracy [1]. Thus, distributed ML enables organizations and individuals to draw meaningful conclusions from vast amounts of distributed training data. Healthcare, enterprise data analytics, and advertising are examples of the most common sectors that greatly benefit from distributed ML.

There are two fundamental ways to perform distributed ML: data parallelism and model parallelism.

Data parallelism and AI

In the data parallelism approach, the system horizontally partitions the input training data; usually, it creates as many partitions as there are available compute nodes (workers) and distributes each data partition to a different worker. Afterwards, it sends the same model features to each worker, which, in turn, learns a local model using their data partition as input. The workers send their local model to a central location, where the system merges the multiple smaller inputs into a single global model.

The model parallelism approach, in contrast to data parallelism, partitions the model features and sends each model partition to a different worker, which in turn builds a local model using the same input data. That is, the entire input training data is replicated by all workers. Then, the system brings these local models into a central location to aggregate them into a single global model.

Yet, although powerful, distributed ML has a core assumption that limits its applicability: that one can have control over and access to the entire training data.

However, in an increasingly large number of cases, one cannot have direct access to raw data; hence, distributed ML cannot be applied in such cases, for example, in the healthcare domain.

The Emergence of Federated Learning

The concept of FL was first introduced by Google in 2017 [3]. Yet, the concept of federated analytics and databases dates from the 1980s [4]. Similar to federated databases, FL aims to bring computation to where the data is.

Federated learning (FL) is basically a distributed ML approach. Unlike traditional distributed ML, raw data at different workers is never moved out of the worker.The workers own the data, and they are the only ones with control over it and direct access to it. Generally speaking, FL allows for gaining experience with a more diverse set of datasets at different independent or autonomous locations.

Ensuring data privacy is crucial in today’s world; societal awareness of data privacy is rising as one of the main concerns of our increasingly data-driven world. For example, many governmental organizations have laws, such as GDPR [5] and CCPA [6], to control the way data is stored and processed. FL enables organizations and individuals to train ML models across multiple autonomous parties without compromising data privacy, since the sole owner of the data is the node on which the data is stored. During training, organizations and individuals share their local models to learn from each other’s. Thus, organizations and individuals can leverage others’ data to learn more robust ML models than when using their own data alone. Multiple participants collaborate to train a model with their sensitive data and communicate only the learned local model among themselves.

The beauty of FL is that it enables organizations and individuals to collaborate towards a common goal without sacrificing data privacy.

FL also leverages the two fundamental execution modes to build models across multiple participants: horizontal learning (data parallelism) and vertical learning (model parallelism).

What are the problems today?

The research and industry communities have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated [7], Flower [8], and OpenFL [9] are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface.

Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the size of the trained model. All are equally important.Among all of these open problems, there is one of crucial importance: supporting end-to-end pipelines.

Currently, users must have good knowledge of several big data systems to be able to create their end-to-end pipelines. They must have decent technical knowledge, starting with data preparation techniques and not ending with ML algorithms. Furthermore, users must also deploy good coding skills to put all the pieces (systems) together to create an end-to-end pipeline. And Federated Learning exacerbates the problem.

How does Blossom Sky solve these problems?

We have built Blossom Sky, a federated AI and data analytics platform, to help users build their end-to-end federated pipelines. Blossom covers the entire spectrum of analytics in end-to-end pipelines and executes them in a federated way. Blossom, in particular, allows users to concentrate solely on the logic of their applications rather than worrying about system, execution, and deployment details.

Overall, Blossom comes with two simple interfaces for users to develop their pipelines: Python (FedPy) for data scientists and a graphical dashboard (FedUX) for users in general.

The internal architecture of Blossom Sky
The internal architecture of Blossom Sky

Blossom provides users with the means to easily develop their federated data analytics in a simple way for a fast execution.

In more detail, users specify their pipelines using any of these two interfaces. Blossom, in turn, runs the created pipelines in a federated fashion using any data processing platform available for the user, be it in any cloud or hybrid constellation. In more detail, users specify their pipelines using either of these two interfaces. Blossom, in turn, runs the created pipelines in a federated fashion using any data processing platform available for the user, be it in any cloud or hybrid constellation.

Code a pipeline with Blossom Sky
Code a pipeline with Blossom Sky

The example above shows the simple WordCount application in Wayang. The intelligence of Blossom Sky, using Wayang as its core, lets the user focus on the data task and not the technical decision on which data processing platform every single step should run on (Java Streams or Spark, in our case). Blossom decides the actual execution based on the input datasets and processing platform characteristics (such as the size of the input dataset and the Spark cluster size). It can do so via an AI-powered cross-platform optimizer and executor.

Blossom Sky is an AI-powered data processing and query optimizer

At Blossom Sky’s core, we use Apache Wayang [10], the first cross-platform data processing system. Blossom leverages and equips Apache Wayang with AI to unify and optimize heterogeneous (federated) data pipelines. Additionally, the system is able to select the right cloud provider and data processing platform to run the resulting federated data pipelines. As a result, users can seamlessly run general data analytics and AI together on any data processing platform.

Blossom Sky’s optimizer mainly provides an intermediate representation between applications and processing platforms, which allows it to flexibly compose the users’ pipelines using multiple processing platforms. Besides translating the users’ pipelines to the underlying processing platforms, the optimizer decides on the best way to perform a pipeline so that runtime is improved, as well as how to move data from one processing platform (or cloud provider) to another.

Blossom Sky’s Cross-Platform Executor

Blossom Sky also comes with a cloud-native cross-platform executor, which allows users to deploy their federated data analytics on any cloud provider and data processing platform available. They can choose their preferred cloud provider and data processing platform or let Blossom select the best cloud provider and data processing platform based on their time and monetary budget. In both cases, Blossom deploys and executes users’ federated pipelines on their behalf.

More importantly, the executor takes care of any data transfer that must occur among cloud providers and data processing platforms. While the optimizer decides which data must be moved, the executor ensures the efficient movement of the data among different providers and data processing platforms.

Conclusion

Thanks to its design, optimizer, and executor, Blossom can provide a real federated data lakehouse analytics framework from the beginning:

  • Heterogeneous data sources Blossom processes data from (or over) multiple data sources in a seamless manner.
  • Multi-Platform and Hybrid Cloud Execution Blossom automatically deploys each sub-part of a pipeline to the most relevant cloud provider and processing platform in a seamless manner to reduce costs and improve performance.
  • Federated machine learning and AI Blossom comes with his own framework (including a parameter server) to run pipelines in a federated manner.
  • Ease of use Blossom allows users to focus on the logic of their applications by taking care of how to optimize, deploy, and execute their pipelines.

References

[1] Alon Y. Halevy, Peter Norvig, Fernando Pereira: The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 24(2): 8-12 (2009).
[2] Diego Peteiro-Barral, Bertha Guijarro-Berdiñas: A survey of methods for distributed machine learning. Prog. Artif. Intell. 2(1): 1-11 (2013).
[3] Brendan McMahan, Daniel Ramage: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Google AI Blog. April 6, 2017.
[4] Dennis Heimbigner, Dennis McLeod: A Federated Architecture for Information Management. ACM Trans. Inf. Syst. 3(3): 253-278 (1985).
[5] General Data Protection Regulation (GDPR): https://gdpr-info.eu/
[6] California Consumer Privacy Act (CCPA): https://oag.ca.gov/privacy/ccpa
[7] TensorFlow Federated: https://www.tensorflow.org/federated
[8] Flower: https://flower.dev/
[9] OpenFL: https://www.openfl.org/
[10] Apache Wayang: https://wayang.apache.org/

About Scalytics

Most current ETL solutions hinder AI innovation due to their increasing complexity, lack of speed, lack of intelligence, lack of platform integration, and scalability limitations. Scalytics Connect, the next-generation ELT (Extract, Load and Transform) platform, unleashes your potential by enabling efficient data platform integration, intelligent data pipelines, unmatched data processing speed, and real-time data transformation.

We enable you to make data-driven decisions in minutes, not days
Scalytics Connect delivers unmatched flexibility, seamless integration with all your AI and data tools, and an easy-to-use platform that frees you to focus on building high-performance data architectures to fuel your AI innovation.
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.
back to all articlesFollow us on Google News
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics streamlines data pipelines, empowering businesses to achieve rapid AI success.

Get started with Scalytics Connect today

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.