Scientific Data and the “data layer”
Enterprise biopharma companies frequently have two interrelated goals: They strive to use analytics and AI to accelerate and improve scientific outcomes, and they seek to capitalize on their scientific data to produce differentiating and innovative applications.
To achieve these goals, organizations must usher their scientific data through a journey of data maturity—a journey with an immutable order of operations represented as a layered pyramid below.
Your journey with scientific data - Unlock value at each level
Enterprise biopharma companies must first replatform and engineer all of their scientific data. To achieve step-function improvements in science by leveraging AI, you must progress one layer at a time.
The bottom two layers of this pyramid are the critical foundation to enable the analytics and AI layers. For simplicity, we will call them the data layers.
How do biopharmas advance upward through these data layers? In an industry with complex data and intricate workflows, it’s common for companies to struggle to find an approach that will meet their long-term needs. In this blog, we will elaborate on various patterns and trade-offs.
Within biopharmas, there are two sub-optimal strategies that organizations use to create “data layers.”
- Implement a classic scientific data management system (SDMS)
- Repurpose a laboratory informatics application (such as ELN/LIMS/MES/LES) as the data layer
You can refer to our blog on the limitations imposed by a classic SDMS and why it has not been widely adopted. If you are unfamiliar with the constraints of an SDMS and how a scientific data cloud addresses those , please take a look at this blog first.
For this article, we will focus on the pros and cons of option 2—the repurposing of a laboratory informatics application as the data layer.
Laboratory informatics application landscape
Lab informatics applications are crucial systems in the scientific data ecosystem. Scientists interact with these applications daily, if not hourly. These systems house company intellectual property (IP) records, ensure compliance with regulations, and store extremely valuable lab-level information about samples, assays, experiment procedures, crystal structures, computational model descriptions, and strategic observations.
Competition
There are numerous informatics applications vendors and they compete fiercely for market position.
- Apprentice.io
- Benchling
- BIOVIA (part of Dassault Systèmes)
- Dotmatics
- Genedata
- IDBS (acquired by Danaher)
- Labguru
- LabVantage
- LabWare
- Revvity (previously PerkinElmer Informatics)
- Sapio Sciences
- Scilligence
- SciY - Bruker’s amalgamation of Arxspan, Zontal, Optimal, and MestreLab
- Siemens (acquired Riffyn)
- Thermo Fisher Scientific
- Uncountable
- Veeva (now entering into the LIMS space)
With steady competition and the diverse needs and preferences of scientists, there has been tremendous growth in the world of laboratory informatics applications. New vendors are entering the space (see Veeva) and existing vendors are expanding their functionality to cover more of the daily requirements of scientists. (Think Office365 versus Google Workspace, except there are many more options.)
One of the common functionality expansions vendors are attempting to deliver is data-layer capabilities. For example, Dotmatics introduced its Luma R&D management platform. Sapio created its Jarvis data management solution. Benchling introduced the Benchling Connect data management solution. BIOVIA attempted to address this space through the combination of BIOVIA Pipeline Pilot and the 3DEXPERIENCE platform. As more traditional LIMS models, LabWare and LabVantage provide the data layer via services.
Simple data landscape and simple vendor ecosystem
Structural limitations of informatics applications providing data layer functionality
1. Data tourism
Informatics application providers have traditionally adopted an approach focused on the scientific workflows in the lab, primarily centered around lab instruments. They seek to deliver a seamless lab experience for scientists with applications rich in user interfaces and visualizations. Providing data capabilities is not their top priority nor a core strength. They cannot dedicate enough resources to support the data layer. The main reason most providers have entered this space is to mitigate the competitive pressures on their core products.
To provide an enterprise-grade data layer, vendors would need to deliver the following capabilities:
- A comprehensive, growing library of data replatforming, data integration, and data engineering components that is continuously maintained and improved.
- A flexible bidirectional data integration framework and orchestrated data flow framework.
- A diverse set of integrations with a plethora of instrument, lab application, and data analytics vendors.
- A data governance and data engineering infrastructure for enterprise support that is high throughput and supports a variety of data types, such as unstructured, binary, time-series, and log data.
- Data sharing and life cycle management to support the complexity of scientific use cases, collaboration, and compliance needs.
- The ability to support data from a variety of third-party vendors, including competitors.
Due to their inability to invest and focus on the data layer, informatics application providers are “data tourists.” The data layer is not at the core of their business. They dabble here because of competitive pressures from other lab informatics vendors, but they will inevitably retrench due to new competitive pressures from true data companies. Most of the data layers provided by informatics applications are limited to mere middleware capabilities, lacking the essentials for analytics/AI-readiness and enterprise scalability. In contrast, TetraScience's focus on data—and data only—enables us to invest in the world's largest, fastest-growing, purpose-built library of integrations and data models.
2. The “walled garden” and vendor lock-in
Informatics application vendors are inherently biased when it comes to lab software—and sometimes lab instruments—as their primary incentive is to sell as many instruments or software seats as possible. As a result, these vendors cannot maintain true openness, neutrality, and data liquidity.
Vendors design application-specific data layers, which are optimized to deliver data to one or a few applications (often within their own, proprietary environment). This approach locks users into a closed environment, which further perpetuates data fragmentation for biopharma organizations that use multiple informatics applications. As a result, scientific data can be difficult, perhaps impossible, to move outside the vendor's "walled garden,” restricting how you can use your data.
For TetraScience, our mission and company goal makes this dynamic simpler:
- We have no agenda other than liberating biopharma companies’ scientific data.
- We are not in the application or hardware business. For example, we do not create or sell LIMS, ELN, or MES solutions, or lab instruments.
- We do not compete with informatics application vendors’ core application business.
TetraScience is committed to true data liquidity. Our customers can use whichever informatics application has the best user experience and/or adds the greatest value to their data. The Tetra Scientific Data Cloud facilitates the flow of data to those (best-of-breed) applications, regardless of developer.
3. Incomplete data and variation
An informatics application’s data layer is designed and implemented to send data into that vendor’s particular informatics application. The data layer does not need to replatform or engineer all the available scientific data, since only a subset will be used within the informatics application. That data subset is determined with only a handful of use cases in mind. But for AI/analytics purposes, it is crucial to include all data, not just the data that is relevant to a specific informatics application. Applications are designed to go deep and specialize in certain data sets. That’s where applications excel. AI and analytics applications require the opposite approach. They need access to large-scale, processed data to achieve meaningful, high-quality scientific insights.
Since informatics applications are used directly by scientists, many experiments are intentionally designed to capture and record a limited set of attributes. Take a chromatography data system (CDS) for example. Some of the application’s data layer retrieves peak properties and structures that data by parsing a CSV file exported from the CDS. However, more meaningful scientific data may be excluded, such as method parameters, logs, user audit trails, full chromatograms, MS spectra, column information, and hardware information. All this data can be extremely useful to inform decisions or unlock insights. But those outcomes are unlikely if only a fraction of the data—defined for a particular scientific workflow at a particular time—feeds into analytics or AI tools. An application-driven data layer does not future-proof data, limiting the utility of advanced analytics and AI.
Another downside of using informatics applications for data layer functionality is data variation. These applications are highly customizable, leading scientists to use inconsistent naming conventions and taxonomies. However, analytics and AI applications demand data consistency. Organizations cannot fully capitalize on analytics and AI if they attempt to use data sets with heterogeneous schemas.
For the Tetra Scientific Data Cloud, instead of “application first,” we choose a “data first” design, with the entire organization and software architecture centered around putting scientific data as the first-class, leading entity. We leave the exploration, visualization, and interpretation of such data to best-of-breed, innovative informatics application providers. For example, our “Intermediate Data Schema” (IDS) is designed to capture as much scientific data as possible. TetraScience maintains the documentation for these schemas while upgrading and evolving them across common scientific data sets. As a result, the engineered data provided by a TetraScience-powered data layer is comprehensive, stable, and consistent—and, importantly, it can be leveraged for advanced analytics and AI.
Summary
To improve scientific outcomes and drive innovation using analytics and AI, biopharma organizations must shepherd their scientific data through the data maturity journey. Though informatics applications may appear adequate to address the requirements of the data layer, they pose several structural challenges that are prevalent across the industry.
Enterprise organizations committed to harnessing the power of analytics and AI recognize the necessity of large-scale, liquid, compliant, and purpose-engineered scientific data. They understand the criticality of separating the data layer from the application layer by leveraging a scientific data cloud.
If you want to learn more about how the Tetra Scientific Data Cloud helps leading biopharmas mature data and achieve AI outcomes, watch our latest video.