Federated data processing has been a standard model for virtual integration of disparate data sources, where each source upholds a certain amount of autonomy. While early federated technologies resulted from mergers, acquisitions, and specialized corporate applications, recent demand for decentralized data storage and computation in information marketplaces and for Geo-distributed data analytics has made federated data services an indispensable component in the data systems market.
At the same time, growing concerns with data privacy propelled by regulations across the world has brought federated data processing under the purview of regulatory bodies.
This series of blog post will discuss challenges in building regulation-compliant federated data processing systems and our initiatives at Databloom that strive towards making compliance as a first-class citizen in our Blossom data platform.
Federated data processing
Running analytics in a federated environment require distributed data processing capabilities that (1) provides a unified query interface to analyze distributed and decentralized data, (2) transparently translates a user-specified query into a so-called query execution plan, and (3) can execute plan operators across compute sites. Here, a critical component in the processing pipeline is the query optimizer. Typically, a query optimizer considers distributed execution strategies (involving distributing query operators like join or aggregation across compute nodes), communication cost between compute nodes, and introduces a global property that describes where, i.e., at which site, processing of each plan operator happens. For example, a two-way join query over data sources in Asia, Europe, and North America may be executed by first joining data in North America and Europe and then joining with the data in Asia.
Growing data regulations, a new challenge
As one can notice, federated queries implicitly ship data (i.e., intermediate query results) between compute sites. While several performance aspects, such as bandwidth, latency, communication cost, and compute capabilities have received great attention, the federate nature of data processing has been recently challenged by data transfer regulations (or policies) that restrict the movement of data across geographical (or institutional) borders or by any other rules of data protection that may apply to the data being transferred between certain sites.
European directives, for example, regulate transferring only certain information fields (or combinations thereof), such as non-personal information or information not relatable to a person. Likewise, regulations in Asia may also impose restrictions on data transfer. Non-compliance to such regulatory obligations has attracted fines in the tune of billions of dollars. It is, therefore, crucial to consider compliance with respect to legal aspects when analyzing federated data.
Data transfer regulations through the GDPR lens
As of now most countries around the world have various data protection laws---with the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) being the most prominent---that impose restrictions on how data is stored, processed, and transferred.
Let's take GDPR as an example. GDPR articles 44-50 explicitly deal with the transfer of data across national borders. Among these, there are two articles and one recital wherein the legal requirements for transferring data fundamentally affect federated data processing.
Article 45: Transfers on the basis of an adequacy decision.
The article dictates that transfer of data may take place without any specific authorization, e.g., when there is adequate data protection at the site where data is being transferred or when data is not subjected to regulations (i.e., when the data does not follow the definition of personal data as in Article 4(1)).
Article 46: Transfers subject to appropriate safeguards.
This article prescribes that (in the absence of applicability of Article 45) data transfer can take place under "appropriate safeguards". Based on the European Data Protection Board (EDPB) recommendations that supplement transfer tools, pseudonymisation of data (as defined under Article 4(5)) is considered as an effective supplementary method.
Recital 108: Transfers under measures that compensate lack of data protection.
Data after adequate anonymization (i.e., when resulting data does not fall under Article 4(1) and as described in Recital 26) does not fall under the ambit of GDPR and therefore can be transferred. Depending on the data and to where that data is being transferred, the above regulations can be classified into:
- No restrictions on transfer: Some data maybe allowed to be transferred unconditionally, and some to only certain locations.
- Conditional restrictions on transfer: For some data, only derived information (such as aggregates) or only after anonymization, can be transferred to (certain) locations.
- Complete ban on transfer: Some data, no matter whatsoever, must not be transferred outside.
Compliance-by-Design: The Challenges
Rather than ad-hoc solutions to make data processing regulation-compliant, a more holistic approach that provides appropriate safeguards to data controllers (entities that control what data and how data should be processed) and data processors (entities that processes data on behalf of a controller) within a federated data processing system is required.
In the context of federated data processing, three aspects must to be revisited:
- First and foremost, data processing systems must offer declarative policy specification languages, which makes it easy and simple for controllers to specify data regulations. Policy specification languages should take into account the type of data, its processing, as well as the location of processing. Regulations may affect processing of an entire dataset, a subset of it, or even information that is derived from it. Policy specifications must also be considered keeping in mind the heterogeneity of data formats (e.g., graph, relational, or textual data).
- The second aspect of ensuring that compliance is at the core of federated data processing is integrating legal aspects in query rewriting and optimization. A system must be able to transparently translate user queries into compliant execution plans.
- Lastly, federated systems must offer capabilities to decentralize query execution, which may also be desired by the compliant plan. We need query executors that can efficiently orchestrate queries over different platforms across multiple clouds or data silos.
Today regulation-compliant data processing is a major challenge, driven by regulatory bodies across the world. In this blog post, we analyzed data transfer regulations from the perspective of GDPR and discussed key research challenges for including compliance aspects in federated data processing. Compliance is at the core of our blossom data platform. In the next blog post, we will discuss how Databloom's blossom data platform addresses some of the aforementioned challenges and ensures regulation-compliant data processing across multiple clouds, Geo-locations, and data platforms.
 Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Query Processing. SIGMOD Conference 2021: 181-193
 Kaustubh Beedkar, David Brekardin, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Data Processing in Action. Proc. VLDB Endow. 14(12): 2843-2846 (2021)
 Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Navigating Compliance with Data Transfers in Federated Data Processing. IEEE Data Eng. Bull. 45(1): 50-61 (2022)