Scalytics | Optimize Data Processing with AI-Powered ETL: Introducing MultiContext in Scalytics Connect

February 13, 2024

Welcome to our latest blog post where we're excited to introduce a powerful new addition to Scalytics Connect: MultiContext. In this post, we'll demonstrate how MultiContext revolutionizes ETL and data processing for organizations, enabling seamless data pipeline deployment across multiple sites while maintaining data privacy and integrity.

Picture this: An organization with disparate departments, each housing its own data processing engines, IT departments, data regulation officer and data engineers. Due to data regulation and privacy concerns, raw data cannot be centralized for analysis. Instead, only aggregated data can be extracted and processed further. Departments A and B utilize their Spark cluster, storing data in HDFS, CSV files and a database with a JDBC connection, while Department C employs a Flink cluster and another database for their data processing needs.

‍

Scalytics Connect MultiContext Explained

‍

This problem was not easy solvable - util now. Scalytics Connect introduces MultiContext ETL data pipelines. Wait, what? Let's dive a bit into and see what that means.

‍

MultiContext explained

Current ETL systems face the challenge of efficiently integrating data residing in diverse sources, like multiple databases, data warehouses, data lakes or other data stores like HDFS or S3. MultiContext data pipeline processing is a next-gen technology solution, enabling direct querying and processing of data within their original environments using the same procedure call (aka in one JVM), eliminating the need for constant data movement and centralization. This innovative approach significantly streamlines data pipelines, paving the way for real-time insights and improved decision-making.

Additionally, MultiContext processing enables enhanced data management by promoting agility, flexibility, and robust security practices. This advancement solidifies its position as a cornerstone technology for next-generation ETL platforms.

Examples of MultiContext Processing

Retail Industry: Combine real-time sales data from point-of-sale systems with customer sentiment analysis from social media platforms, all within their respective contexts, to gain immediate insights into customer behavior and preferences.
Financial Services: Analyze customer financial data stored in a secure banking system alongside market data from external sources, in real-time, to make informed investment decisions.

By eliminating the need for lengthy data movement and transformation processes, MultiContext processing enables the creation of ever-running data pipelines that deliver valuable data insights in minutes, not days. This significantly improves efficiency and allows businesses to make data-driven decisions faster than ever before.

‍

Use MultiContext in Scalytics Connect

The MultiContext() function in Scalytics Connect empowers developers to easily create separate configurations tailored to the specific needs of each department. These configurations encompass details like Spark and Flink clusters, JDBC installations (if applicable), and the designated path for storing processed data:


val context1 = new ScalyticsContext(configuration1)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Spark.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("file:///path/to/output/out1")

val context2 = new ScalyticsContext(configuration2)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Spark.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("hdfs:///path/to/output/out2")

val context3 = new ScalyticsContext(configuration2)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Flink.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("hdfs:///path/to/output/out3")

‍

Next, utilizing the MultiContextPlanBuilder, developers can outline the desired data processing tasks across these disparate contexts.


val multiContextPlanBuilder = new MultiContextPlanBuilder(List(context1, context2, context3));

val customers = multiContextPlanBuilder
	.readTable(context1, customersDB1)
	.readTable(context2, customersDB2)
	.readTable(context3, customersDB3)

.forEach(_
/* Filter for year 2022 */
.filter(record -> record.getField(5).contains(“2022”)
)

val sales = multiContextPlanBuilder
.readTextFile(context1, salesFile1)
  	.readTextFile(context2, salesFile2)
.readTextFile(context3, salesFile3)

.forEach(_
  	.map(line -> {
       String [] vals = line.split("\\,");
       return new Record(vals[0], vals [1], Double.parseDouble(vals[2]), 1);
            	})
     /* Filter for year 2022 */
     .filter(record -> ((String)record.getField(0)).contains("2022"))

val results = sales
/* Join sales with customers data*/
.combineEach(customers, (dq1: DataQuanta[Customer], dq2: DataQuanta[Result]) => dq1.join(id, dq2, _._2))

     /* Aggregate data */
     .reduceByKey(record -> record.getField(1), (r1, r2) -> {
                	return new Record(r1.getField(0), r1.getField(1), r1.getDouble(2) + r2.getDouble(2), r1.getInt(3) + r2.getInt(3));
            	})
       /* Average */
       .map(r -> new Record(r.getField(1), r.getDouble(2)/r.getInt(3)))
	).execute()

‍

With just a few lines of code, developers can now execute in-situ data processing across different sites. What's more, installations in each site can be heterogeneous, accommodating various clusters like Spark or Flink seamlessly. Importantly, one can issue the same job to multiple Spark clusters simultaneously, something that is not possible within Spark because it does not allow to have more than one Spark context in a single JVM.

In conclusion, MultiContext heralds a new era of federated and in-situ data processing, empowering organizations to leverage their distributed infrastructure while ensuring data privacy and efficiency. Stay tuned for more updates and insights from the Scalytics team as we continue to innovate in the realm of unifying data processing solutions.

About Scalytics

Legacy data infrastructure can't keep pace with the speed and complexity of modern AI initiatives. Data silos stifle innovation, slow down insights, and create scalability bottlenecks. Scalytics Connect, the next-generation data platform, solves these challenges. Experience seamless integration across diverse data sources, enabling true AI scalability and removing the roadblocks that hinder your AI ambitions. Break free from the limitations of the past and accelerate innovation with Scalytics Connect.

We enable you to make data-driven decisions in minutes, not days
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.