AI training over private sensitive distributed data sources
More and more organisations are turning to distributed data sources for their analytics needs because of data volume constraints or privacy concerns. For example, hospitals and medical organizations are increasingly using machine learning and deep learning to learn from medical data. Machine learning and deep learning have helped these organisations better schedule surgical operations as well as predict patient re-hospitalizations after organ transplants. Similarly, Earth observation applications require analysing large volumes of distributed data for flood damage estimation or wildfire prediction purposes. As a result, many companies and individuals need to run their analytics in situ, i.e., where the data is, without (or with only very low) data movement across owners of data storage sites.
Blossom Sky embraces in situ data analytics at its core
In situ data processing is typically performed by scientists, researchers, and engineers who work in fields such as earth science, environmental science, and oceanography. These individuals use specialized equipment and software to collect and analyze data in the field, often in remote or difficult-to-access locations. The data is then used to improve our understanding of the environment and to inform decision-making related to resource management and conservation. In some cases, the data processing may be done by automated equipment or by a team of data scientists and engineers who work remotely.
Databloom Blossom provides a framework for running analytics in a federated fashion. Because it is not possible to move data to a single place where one can run the analytics, either because of privacy concerns or because of expensive large data transfers, Blossom uses a federated learning approach to bring the computation to the data instead of the data to the computation. For instance, Databloom Blossom allows users to prepare and build predictive machine learning models over distributed privacy-sensitive data sources, and more importantly, prevents users from dealing with how to deploy their predictive models. For example, in healthcare applications, this could give medical professionals easy access to information that could be used in diagnosing patients.
Generative AI in health tech
In situ data processing can help with sensitive data, such as information protected by the Health Insurance Portability and Accountability Act (HIPAA), by keeping the data within the secure environment where it was originally collected.
For example, in situ data processing can allow healthcare providers to analyze patient data, such as medical imaging or electronic health records, directly on the patient's device or on a secure server within the healthcare facility. This eliminates the need to transfer the data to a remote location, which reduces the risk of data breaches and ensures compliance with HIPAA regulations. Additionally, in situ data processing can also include techniques like data de-identification, encryption and access control to further protect the sensitive data. Let's discuss a concrete use case; long Covid-19 research.
A pharmaceutical company that would like to know how effective each COVID-19 vaccine type could be for certain population segments in the North American area. The data science team will require access to all vaccination centers and health centers to know what type of vaccine each citizen has taken, as well as information about the health of vaccinated citizens and possible hospitalizations. This means the company needs access to HIPAA data from health centers in targeted areas. Additionally, to study different segments of citizens who belong to different municipalities, the data scientist will have to analyze data from all municipalities in the area. This use case implies multiple data sets, located in different data silos; it is highly privacy-relevant because any violation of regulation would result in fines, while the company's reputation is at stake due to its public image. Thanks to Blossom Sky and Federated Learning, the data will be processed in situ. This approach avoids data source breaches and regulatory issues. A typical data modelling pipeline could look like the one we used in that use case:
Satellite and Space: Earth Observation
The Earth Observation Programme Directorate's mission is to keep observation techniques up to date technologically and to develop cutting-edge products. This directorate pursues scientific knowledge and transforms it into products that are beneficial to society as a whole. The well-known meteorological missions of the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) fall under the purview of this directorate, as do Sentinel satellites developed by the European Space Agency in collaboration with the European Commission under the Copernicus program.
One example of the European Sentinel-1 DIAS (Deep Impact of the Atmosphere and Surface) project is the observation of flood-damaged environments on a larger geographical scale to determine the damage caused by climate change. The DIAS platforms provide time-series of synthetic aperture radar (SAR) images that match the desired spatial and temporal constraints of these events. To get an even deeper insight, the model extracts water masks and maximum extent layers for SAR images using a deep neural network. Training a neural network for semantic segmentation of water masks over all available data is now also possible since the model is able to access multiple, semantically different databases and train at image stores located in universities and facilities across Europe.
The increasing need for in situ and federated learning
Many industries could benefit from in situ data processing and federated learning. Some examples include:
- Healthcare: In situ data processing can be used to analyze patient data directly on a secure device within a healthcare facility, ensuring compliance with regulations like HIPAA. Federated learning can be used to train models on patient data from multiple hospitals while keeping the data secure and private.
- Manufacturing: In situ data processing can be used to analyze sensor data from manufacturing equipment, allowing for real-time monitoring and maintenance. Federated learning can be used to train models on data from multiple manufacturing sites, allowing for the sharing of best practices and improved efficiency.
- Finance: In situ data processing can be used to analyze financial data directly on a secure server within a financial institution, ensuring compliance with regulations like the Gramm-Leach-Bliley Act. Federated learning can be used to train models on financial data from multiple institutions, allowing for the sharing of best practices and improved risk management.
- Internet of Things (IoT): In situ data processing can be used to analyze sensor data from IoT devices, allowing for real-time monitoring and control. Federated learning can be used to train models on data from multiple IoT devices, allowing for the sharing of best practices and improved decision-making.
- Transportation and logistics: In situ data processing and federated learning can be used to analyze sensor data from vehicles and logistics systems, allowing for real-time monitoring and control, and improve decision-making and optimize routes.
- Energy: In situ data processing and federated learning can be used to analyze sensor data from energy systems, allowing for real-time monitoring and control, and improve decision-making and optimize energy distribution.
In conclusion, in situ data analytics is a powerful tool that allows for the collection and analysis of data directly in the field, without the need to transfer samples or data to a remote location. This approach has many benefits, including improved accuracy, reduced costs, and minimized impact on the environment. Additionally, in situ data processing can ensure compliance with regulations such as HIPAA, and it can also provide real-time monitoring and control, which can be useful in many industries such as healthcare, manufacturing, finance, IoT, transportation, logistics and energy. Furthermore, when combined with federated learning, in situ data processing can enable the sharing of best practices and improved decision-making across multiple sites or devices, which can lead to increased efficiency and improved outcomes.
Databloom is a federated data access and analytics company that develops the federated analytics platform "Blossom Sky" to enable decentralized AI. It provides fast and interactive enterprise-ready distribution, consisting of additional tooling and configurations, enabling data scientists and analysts to run AI models and training against various decentralized data sources ranging in size from gigabytes to petabytes. Databloom is a leading contributor to Apache Wayang, the federated data processing engine.
Want to know more? Get in touch with us via databloom.ai/contact.