Decentralized data processing in new dimensions.

Perform data operations, Machine Learning and AI training directly at the source, on edge, and in your favorite framework - at the same time. Blossom provides you the toolset to be a leading privacy-preserving data driven enterprise.

Blossom AI Platform
Databloom AI is on a mission to help data teams to solve their problems and empower businesses with truly decentralized data infrastructures and homogenized data, intelligently executed where it is. Databloom is one of the data mesh pioneers and brings this technology to our users, customers and partners, helping them to enable untapped opportunities in data driven economies.
Databloom provides with Blossom Sky a secure and data compliant data mesh framework. The founders of Databloom are the top experts in Big Data and distributed data processing, and also part of the project committee at the Apache Wayang project, the API for cross-platform data processing.
Blossom Sky is a next-gen data mesh controller
Blossom works with all major data processing and data streaming frameworks and AI systems. We support JDBC as well as PostgreSQL and working on a native high-performance SQL layer, called SQL everywhere. Get started without hassle:
> docker pull

Research behind our technology

Research conducted to develop our data mesh related technology

ML-based Cross-Platform Query Optimization

ICDE 2020

Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned.

RHEEMix in the Data Jungle

VLDB Journal 2020

Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source crossplatform system that copes with these new requirements.

Optimizing Cross-Platform Data Movements

ICDE 2019

Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner.

Simplified Big Data Debugging for Dataflow Jobs

SoCC 2019

Although big data processing has become dramatically easier over the last decade, there has not been matching progress over big data debugging. It is estimated that users spend more than 50% of their time debugging their big data applications.

Enabling Cross-Platform Data Processing

VLDB 2018

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result,organizations typically perform tedious and costly tasks to juggle their code and data across different platforms.

Cross-Platform Data Analytics Made Easy

ICDE 2018

Many of today’s applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption.

Cross-Platform Data Processing: Use Cases and Challenges

ICDE 2018

There is a zoo of data processing platforms which help users and organizations to extract value out of their data.Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms.

Building your Cross-Platform Application with RHEEM

CoRR 2018

Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms.

Fast and scalable inequality joins

VLDB Journal 2017

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such assort-merge join, to the use of efficient indices such as B+-tree, R∗-tree and Bitmap.

A Cost-based Optimizer for Gradient Descent Optimization


As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify anML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it.

Road to Freedom in Big Data Analytics

EDBT 2016

The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In Doing so, they face many challenges; chiefly, platform dependence,poor interoperability, and poor performance when using multiple platforms.

Enabling Multi-Platform Task Execution


Many emerging applications, from domains such as healthcare and oil & gas, require several data processing system for complex analytics. This demo paper showcases Rheem, a framework that provides multi-platform task execution for such applications

BigDansing: A System for Big Data Cleansing


Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions.

Lightning Fast and Space Efficient Inequality Joins

VLDB 2015

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap.

Talks and videos

Flower Summit '22

We talked about democratizing AI and ML with Blossom and the Flower framework (pip flwr). Follow this talk to learn how we built a next-gen data mesh for users without having the power of the FAANG tech giants.

Apache Con@Home '21

We are living in a data deluge era, where data is being generated by a large number of sources. Nowadays, a large number of different devices are generating data at an unprecedented scale: smartphones, smartwatches, embedded sensors in cars, smart homes, wearable technology, just to mention a few.

BOSS Conference '21

Tutorial: Apache Wayang: A Big Data Cross-Platform System

Apache Wayang is a cross-platform processing engine, the world’s leading federated data processing platform. In this workshop the Apache Wayang team present the science and technolgy behind.

RedisConf '21

To perform time-series forecasting Scalytics Lumina first preprocesses the data to bring it into the right form and then runs an AI algorithm. Apache Wayang (incubating), the smart open-source technology behind Scalytics Lumina, is responsible for this step.

Spark Summit

We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency.

Get Started

Want to get started on your own? Apache Wayang is open source and ready for you to start building your federated data processing engine.
Get Apache wayang
Apache Wayang