Debugging is an important part of software engineering. To put it simply, it's when you find and solve issues in computer programs. There are various ways to debug, and all are sometimes long and sometimes painful—we all know that. Debugging is a process we as software developers need to do, but we hardly love it.
In a former blog post, we pointed to the difficulty of debugging big data applications, the problems during the debugging process, and the multi-layered processes created by using federated learning on multiple, independent data stores. Today we’re introducing "TagSniff," a new debugging technique that adds simple instrumentation primitives for online and post-hoc debugging.
TagSniff is part of Blossom Core and feeds into our upcoming AI advisor, a new generative AI to help debug AI pipelines across federated datastores. The Blossom advisory service will detect bugs or invalid data modifiers, trace them, and report the findings back to the data engineers' console. During the process, the advisor learns and can find and fix bugs automatically. The data engineer has the choice to apply suggestions manually or integrate the advisor in a CI/CD pipeline. The advisor will be able to open pull requests and suggest code changes for review. We plan to release a first code preview in Q3 2023, but we can't and won't commit to this timeline at that early stage of development.
We're introduce TagSniff in more detail in this blog post: what it is, its components, and how to use it to debug your code.
The idea behind "TagSniff"
The idea behind TagSniff is a data flow instrumentation approach consisting of two primitives, tag and sniff, operating as debug tuples. A debug tuple is a data structure consisting of multiple parts that flows between the data operators whenever debugging is enabled. The tag primitive adds tags to a tuple, while the sniff primitive identifies tuples requiring debugging or further analysis based on their metadata or values. The unique characteristic of these primitives is that users can easily add custom debugging functionality via user-defined functions (UDF). A TagSniff system refers to any system that implements this abstract debugging model.
The debugging item store
The debug tuple is the tuple on which the TagSniff primitives operate. A debug tuple is composed of the original tuple prefixed with annotations and/or metadata, <|tag1|tag2|..., <tuple>>. Annotations describe how users expect the system to react, while metadata adds extra information to the tuples, such as an identifier. The table below illustrates an example set of annotations.
Tags are inserted by either users or the debugging system and mainly stem from dataflow instrumentation. The users can manipulate these tags to support sophisticated debugging scenarios, such as lineage. To enable this tag manipulation, TagSniff provides the following methods on the debug tuple:
- add_tag (tag: String)
- get_tag (tag: String)
- has_tag (tag: String)
The "tag and sniff" primitives
The TagSniff model provides two primitives, tag and sniff, to instrument the debug tuple. The tag primitive is used for adding tags to a tuple. The input is a UDF that receives a tuple and outputs a new tuple with any new tags users would like to append. The sniff primitive is used for identifying tuples requiring debugging or further analysis based on either their metadata or values. The input is a UDF that receives a tuple and outputs true or false depending on whether the user wants to analyze this tuple or not. Let's take a look at two specific debugging tasks that can be implemented using "TagSniff" without requiring a lot of boilerplate code.
Example 1: Data Breakpoint
Suppose you want to add a data breakpoint in a Spark program that retrieves the top-100 most frequent words. You want to pause the execution of the program whenever it encounters a tuple with a null value, so that you can further inspect it. Here's how you can achieve this using the tag and sniff primitives:
In the above code, the tag primitive adds a "pause" tag to any tuple that contains a null value, while the sniff primitive checks if a tuple has the "pause" tag and returns true if it does, indicating that the execution of the program should be paused at that point.
Example 2: Log
Suppose you want to log any tuple that contains a null value so that you can use it for tracing later on. You need to generate a unique identifier for each tuple and add it to the tuples metadata. Here's how you can achieve this using the tag and sniff primitives:
In the above code, the tag primitive generates a unique identifier for any tuple that contains a null value and adds it to the tuple's metadata along with a "log" tag. The sniff primitive checks if a tuple has the "log" tag and returns true if it does, indicating that the tuple should be logged.
TagSniff in a nutshell
It's worth noting that "TagSniff" was designed to be as simple as possible and is defined at the tuple granularity only. However, you might wonder how to use "TagSniff" on a set of tuples. One approach is to use the tag primitive to add a tag to a tuple that indicates its membership in a set, and then use the sniff primitive to check for the presence of that tag on a per-tuple basis. By providing only two primitives, tag and sniff, the model makes common debugging tasks easy to compose and custom debugging tasks possible. In particular, its main advantage is its flexibility in supporting most online and post-hoc debugging tasks easily and effectively. We will illustrate these two data debugging modes in the following blog post. Additionally, users can easily add custom debugging functionality via user-defined functions, which adds to the model's flexibility.
Overall, the TagSniff model provides a powerful abstraction for data debugging that can be used in a variety of contexts and can significantly reduce the amount of boilerplate code required for debugging tasks.
Databloom is a software company that has developed a powerful AI-Powered Data Platform Integration as a Service platform called Blossom Sky. This platform enables users to unlock the full potential of their data by connecting data sources, enabling generative AI, and gaining performance by running data processing and AI directly at independent data sources. Blossom Sky allows for data collaboration, increased efficiency, and new insights by breaking data silos in a unified manner through a single system view. The platform supports a wide range of ML and AI algorithms and is designed to adapt to a wide variety of AI algorithms and models.