Blog

Our Scientific Data Takeaways From Attending Databricks’ Data+AI Summit

June 20, 2024

The TetraScience team joined thousands of our fellow data practitioners at last week’s Databricks Data+AI Summit in San Francisco. With attendance now topping 60,000 live and online, the summit underscores how successful Databricks has been at building support for its scalable and fast lakehouse architecture for data management. Thanks to open and extensible horizontal data platforms, the era of widespread data intelligence and AI is well underway. 

In case you missed it, TetraScience announced a strategic partnership with Databricks a few weeks ago. We’re joining forces to help life sciences organizations harness Scientific AI to bring more effective and safer therapies to market faster and less expensively. How? By assisting organizations in replatforming and engineering their raw data into AI-native scientific data that comes to life in a proliferating set of powerful AI and analytics use cases.

We had some great conversations at the summit with leaders in data and analytics in the life sciences industry. The intel we gained from our discussions and what we heard during the keynotes and breakouts can boil down into three essential takeaways worth sharing with the TetraScience community here. 

Every business wants to be an AI business. AI presents a once-in-a-century opportunity to improve drug discovery, development, and manufacturing, just as it does for organizations across the business landscape. AI and machine learning models can find meaning and predictive patterns in massive and complex data sets in a way that no human can, and few industries produce as much complex data as the life sciences. Each of the top 150 global life sciences organizations produces an estimated 50 petabytes per year, an unimaginably large amount of data equal to 25 times all the information in America’s research universities. These organizations also struggle with enormous data silo challenges. Some global biopharma firms have to wrestle with millions of silos due to incompatible and proprietary formats and bespoke data integrations. 

Part of the reason the industry is such a prolific data creator (and victim to so many data silos) is that each step, from discovery to launch, is often incredibly complex, as well. Chromatography instruments just on their own produce a dizzying variety of data such as compound retention time, peak areas and heights, mass spectra, purity, concentration, and response factors. Or consider a bioprocess such as vaccine production, the subject of a fascinating presentation at the Summit by Dr. Sander Timmer, Senior Director of AI at GSK, which produces a wider variety of vaccines than any other drugmaker. 

Making a vaccine entails combining cells and nutrients in large stainless-steel bioreactors to produce the necessary antigens. It’s never quite precise, and yield and product quality vary widely from batch to batch. GSK has always sought ways to improve its ability to predict and control the process—AI and machine learning to the rescue. Timmer’s team at GSK fed all of the data from IoT sensors on the tanks and lab instruments into an AI model that can predict what will happen and what actions to take. This so-called digital twin, a virtual replica of the physical process, can quickly and efficiently test different perturbations to the AI model to assess their impact on the product and improve yields.

Timmer’s team recently enhanced the twin with a generative AI model called the TwinOps Copilot that allows GSK’s process owners to query the digital model in plain English, asking questions such as, “Something isn’t working well in the process. Can you show the yield evolution?” Or, “I see missing values in the data; can you impute the missing values and correct the noise?” Timmer’s team has also embraced synthetic data to augment its models and improve their predictive performance. “We can use data and AI to improve every step along the way,” he said.

But in reality, nearly all organizations are still in the early stages of their AI journey. During Databricks CEO Ali Ghodsi's keynote, he candidly revealed that 85% of the companies on his platform are yet to operationalize their AI model use cases. A late 2023 Gartner study confirmed this shortage of real-world use cases: Some 45% of executives said they had some generative AI pilots underway, and only 10% had models in production. The relatively gradual progress of enterprise AI into production underscores the challenges companies face in this space.

Wrangling data is still a huge part of the equation. AI and data intelligence are still about data quality, formats, and preparation.

Jensen Huang, CEO
Nvidia

What is the stumbling block? It’s the data. Progress toward being “AI-ready” is being severely held back by the state and quality of enterprise data. During his keynote, Nvidia CEO Jensen Huang, whose GPU chips are powering the AI revolution, said as much: “Wrangling data is still a huge part of the equation. AI and data intelligence are still about data quality, formats, and preparation.” 

Modeling from data is relatively easy when the data is clean, well-labeled, and structured correctly. However, data used for modeling often needs more critical information and uniformity, version history, and the contextual metadata necessary for pursuing advanced analytics or AI/ML use cases. "The data estate is still too fragmented,” said Databricks CEO Khodsi.

The problem is especially acute in the life sciences, where any decent-sized pharmaceutical R&D operation will have hundreds of different instruments of different types with a passel of incompatible and proprietary formats. Many life sciences companies have also grown through acquisition, leading to struggles in merging bespoke data integrations and achieving uniformity across labs and systems. Cutting through this complexity is why TetraScience remains steadfastly open and vendor-agnostic in our goal of building the Scientific Data and AI Cloud. We think a global life sciences organization should be able to connect any data source to any data consumer via an industrialized, compliant, and automated platform. 

Pushpendra Arora, Director of Data Analytics and Solutions for Human Health at Merck, also shared a 30-minute talk on his team’s investment in resolving his organization’s data challenges at the summit. His team is working on creating “data products,” or collections of data, metadata, and other components that are curated to be useful for specific business needs. Data products often share the same frameworks and rules as part of a deployable package that’s accessible, insightful, and actionable. Data products, designed well, should produce informed decisions and strategic insights more quickly. In Merck’s case, Arora said the team achieved a 40% faster speed of data-to-insight, a 60% reduction of redundant business rules, and a 30% increase in new use case implementations.  (TetraScience also transforms raw data into AI-native data that can support various use cases in analytics, AI, and machine learning.) 

Having harmonized data is everything. Poor data quality and a lack of standardization add cost and complexity and slow every business's AI journey. Data issues also greatly impact the quality of the outcomes from AI models. As Nvidia’s Jensen said, data wrangling is still a bottleneck to scientific insight, efficient manufacturing, and higher-quality process engineering. 

One pharma executive summed up the problem and solution quite well: “There is a need to consolidate all data in a single, easily accessible location with complete traceability, provenance, and version history. It would be great to get all the data to one place where it can be easily findable without worrying about people manipulating it so we can compare results across experiments to identify our golden batch.” 

AI-native data is all about producing more “golden batches” and ROI across the life sciences value chain. Unlocking a return on data and accelerating data intelligence overall are the primary drivers behind our partnership with Databricks: Making it easier for customers to harness the value of their data by ensuring it’s standardized, context-rich, and re-engineered for advanced analytics and AI.