Machine Learning (ML) has not only become omnipresent in our everyday lives (with self-driving cars, digital personal assistants, chatbots etc.) but has also started spreading to our core technological systems, such as databases and operating systems. In the area of databases, there is a large amount of works aiming at optimizing data management components, from index building, knob tuning to query optimization. Just in query optimization, ML is used in the place of many optimizer components, such as cardinality estimation, cost model, and join enumeration. In this blog post, we focus on the case of using an ML in the place of a cost model and go from the traditional cost-based query optimization to the newly proposed ML-based query optimization.
ML-based query optimization
Generally speaking, given a user query, a query optimizer finds the best way to execute this query (i.e., execution plan) so that the runtime is minimized. For example, Databloom Blossom's query optimizer is responsible to figure out which platform combination is the best to execute a query so that the runtime of the given query is as low as possible. To do so, traditional cost-based optimizers use a cost model (a set of mathematical formulas) that captures the cost of executing a plan and an enumeration algorithm that searches for different plans until it finds the one with the smallest cost. Going from a cost-based query optimizer to an ML-based one, the cost model is replaced with an ML model (usually a regression model) that outputs an estimate of the runtime of a plan [1]. Then the enumeration algorithm simply invokes the ML model while searching different plans to determine the one with the lowest runtime estimate based on the ML model.
The hurdle of training data collections
It is well known by now that ML models can be as good as their training data is. It is the same in ML-based query optimization:
" The effectiveness of an ML-based query optimizer highly depends on the quantity and quality of training data as well as the availability of valuable ground-truth labels. "
In this case, a training dataset includes a large set of query plans together with their runtime that serves as a label. Training data collection is a bottleneck as it requires (i) gathering thousands of heterogeneous query plans and (ii) executing all of them to get their runtime. The latter is a very time-consuming task, as it leads to the execution not only of a large number of plans but also the execution of sub-optimal ones. To get a better idea on the amount of time required to get these labels, see at the two figures below. Collecting labels for only 500 OLAP plans with input data of about 1TB in our four-quad-core-nodes cluster takes almost 10 days, while executing 10,000 plans in just 1GB of data requires a bit more than 4 days. If we extrapolate this to 10,000 plans on 1TB of data it would require more than 6 months!

Executing 10,000 plans on 1TB of data to collect their labels would require more than 6 months!
Even if logs from previously executed queries are available, the plans in the logs are the ones the optimizer chose to execute and, thus, most of them are (near-)optimal. Training a model with only optimal plans would lead to a biased model.
DataFarm: Generative training data generator for ML-based query optimizers
DataFarm [2] is a framework for efficiently generating training data (query plans with their execution runtime).
" DataFarm enables users to reduce the cost of getting labeled query workloads by 54× compared to standard manual approaches. "
It is based on a data-driven white-box approach. A user inputs a typically very small set of query plans (e.g., 10) and DataFarm augments it to arrive to thousands of plans and attaches labels (runtime) with uncertainty values to each generated plan. The figure below shows an overview of DataFarm.

The Abstract Plan Generator learns patterns from the input query workload as Markov Chains and generates new heterogeneous abstract plans (plans without specific UDF values) exploiting the real operators’ distributions. These plans follow the patterns specified in the input workload. For example, if in the input workload a group by operator precedes in most of the cases a sort operator, the same will be tru in the generated plans. The abstract plans generated at this phase cannot be executed yet because their operators do not specify user-defined values, such as the join key or the selection predicate, neither the platform on which they have to run.
The Plan Instantiator then receives the generated abstract plans and creates an augmented set of executable plans by instantiating different variants for each abstract plan, e.g., by setting different selection predicates. To achieve that it exploits the user’s input data metadata. This is crucial so that the generated plans are meaningful and can actually be executed without any exceptions or empty results.
Once we have the generated plans, the Label Forecaster uses an active learning approach to label the generated query workload efficiently. It executes only few of the generated plans and forecasts the labels of the rest with an interpretable ML model based on Quantile Regression Forest. It iteratively exploits the uncertainty of the model to select a small number of jobs to execute and outputs the forecasted labels for the non-executed ones along with their uncertainty values. Downstream operations can then leverage these uncertainty values to improve their output, e.g., by using them as a noise indicator.
Human-guided training data generation
As users often know their desired query workload, we have introduced the human in the active learning step of the Label Forecaster. This leads to an increase in the quality of the training data and thus, the downstream ML model. The human gives insights to the Label Forecaster on which jobs is better to execute to get better-estimated labels. As the human alone cannot manually select among thousands of plans, the Label Forecaster iteratively suggests to the user a small set of candidate plans for execution. Then, the user can inspect these candidates and remove or add new plans via an intuitive graphical user interface (GUI). The GUI of DataFarm provides useful insights, such as feature importance and model explanation analysis, such that the users can take informed decisions [3].
Results
To evaluate the quality of DataFarm's training data, we generated 2000 plans from only 6 initial queries. We construct four training sets that differ on the process of obtaining the labels: The first training set contains ground truth labels that we obtain after executing all jobs, while the other three contain labels acquired by sampling 166 jobs to execute and using the ML model of the Label Forecaster to predict the labels of the rest. We used three different sampling mechanisms: (i) random, (ii) agglomerative clustering (DATAFARM without- the-human), and, (iii) manually modifying the set suggested by the agglomerative clustering (DATAFARM with-the-human). Based on these four training sets, we build four ML models, respectively, and predict our input query workload runtimes using each model. The results are shown below.

We observe that the quality of training data generated by DataFarm is as good as the ground truth (green and blue bars, respectively). Most importantly, we observe that
DataFarms for generative AI training with human interaction and reinforced learning outperforms autark and self-training DataFarms as well as the ground truth model.
This result is possible because the user can determine the essential features for the ML model and make modifications according to her observations (adding and removing jobs).
About Databloom
Databloom is a software company that has developed a powerful AI-Powered Data Platform Integration as a Service platform called Blossom Sky. This platform enables users to unlock the full potential of their data by connecting data sources, enabling generative AI, and gaining performance by running data processing and AI directly at independent data sources. Blossom Sky allows for data collaboration, increased efficiency, and new insights by breaking data silos in a unified manner through a single system view. The platform supports a wide range of ML and AI algorithms and is designed to adapt to a wide variety of AI algorithms and models.
References
[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.
[2] Francesco Ventura, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Expand your Training Limits! Generating Training Data for ML-based Data Management. SIGMOD Conference 2021: 1865-1878.
[3] Robin P. van de Water, Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl: Farm Your ML-based Query Optimizer’s Food! – Human-Guided Training Data Generation – . CIDR 2022 (abstract).
[4] Robin P. van de Water, Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl: Farming Your ML-based Query Optimizer’s Food. ICDE 2022 (demo), to appear.