Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned.
Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source crossplatform system that copes with these new requirements.
Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner.
Although big data processing has become dramatically easier over the last decade, there has not been matching progress over big data debugging. It is estimated that users spend more than 50% of their time debugging their big data applications.
Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result,organizations typically perform tedious and costly tasks to juggle their code and data across different platforms.
Many of today’s applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption.
There is a zoo of data processing platforms which help users and organizations to extract value out of their data.Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms.
Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms.
Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such assort-merge join, to the use of efficient indices such as B+-tree, R∗-tree and Bitmap.
As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify anML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it.
The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In Doing so, they face many challenges; chiefly, platform dependence,poor interoperability, and poor performance when using multiple platforms.
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing system for complex analytics. This demo paper showcases Rheem, a framework that provides multi-platform task execution for such applications
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions.
Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap.