Machine Learning for X and X for Machine Learning

Dr. Zoi Kaoudi
Machine Learning for X and X for Machine Learning
June 15, 2022

After the recent advances in Artificial Intelligence (AI), and especially in Machine Learning (ML) and Deep Learning (DL)*, various other computer science fields have gone into a race of "blending" their existing methods with ML/DL. There are two directions to enable such a blend: Either using ML to advance the field, or using the methods developed in the field to improve ML. A commonly used slogan when combining ML with a computer science field is: ML for X or X for ML, where X can be, for instance, any of {databases, systems, reasoning, semantic web}.

In this blog post, we focus on cases where X = big data management. We have already observed works on ML for data management and on data management for ML since several years now. Both directions have a great impact in both academia, with dedicated new conferences popping up, as well as in the industry, with several companies working on either improving their technology with ML or providing scalable and efficient solutions for ML.

Databloom.ai is one of the first companies to embrace both directions within a single product. Within Databloom's product, Blossom, we utilize ML to improve the optimizer and, thus, provide better performance to users but also we utilize well-known big data management techniques to speed up federated learning. Databloom.ai is the first to embrace both directions within a single product.

Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai
Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai

ML for Data Management

In a previous blog post we discussed how we plan to incorporate ML into Blossom's optimizer. Having an ML model to predict the runtime of an execution plan can have several benefits. First, system performance can rocket as the optimizer is able to find very efficient plans. For example, for k-means, an ML-based optimizer [1] outputs very efficient plans by finding the right platform combination and achieve 7x better runtime performance than a highly-tuned cost-based optimizer! Second, the hard part of manually tuning a cost model in an optimizer is vanished. With just collected training data (plans and their runtime) you can easily train an ML model.


Data Management for ML

In an earlier blog post we discussed how Blossom, a federated data lakehouse analytics framework, can provide an efficient solution for federated learning following principles from data management. To enable federating learning, we are working towards a parameter server architecture. However, as simplistic parameter servers are very inefficient due to excessive network communication (i.e., large number of messages sent through the network) we utilize data management techniques such as increasing locality, enabling latency hiding, and exploiting bulk messaging. The database community has already exploited such optimizations and showed that they can lead to more than an order of magnitude faster training times.

About Databloom

Databloom is a software company that has developed a powerful AI-Powered Data Platform Integration as a Service platform called Blossom Sky. This platform enables users to unlock the full potential of their data by connecting data sources, enabling generative AI, and gaining performance by running data processing and AI directly at independent data sources. Blossom Sky allows for data collaboration, increased efficiency, and new insights by breaking data silos in a unified manner through a single system view. The platform supports a wide range of ML and AI algorithms and is designed to adapt to a wide variety of AI algorithms and models.

References

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.
[2] Alexander Renz-Wieland, Rainer Gemulla, Zoi Kaoudi, Volker Markl: NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. SIGMOD Conference 2022: 481-495.

* We use the term ML to refer to both Machine Learning and Deep Learning.
** Icons appearing in the figure were created by Creative Stall Premium - Flaticon

back to all articlesdatabloom.ai blog rss feed
For details or building a customized plan please contact sales.

Get Started

Want to get started on your own? Apache Wayang is open source and ready for you to start building your federated data processing engine.
Get Apache wayang
Apache Wayang