Over the last four years, TetraScience has been replatforming and engineering scientific data from hundreds of thousands of data silos. Our goal is to combat the inherent fragmentation of the scientific data ecosystem to unlock Scientific AI.
The approach we’ve taken—leveraging productization, using a data stack designed for scientific data, and adhering to a vendor-agnostic business model—has been very challenging. However, we firmly believe that our strategy is the only way to generate the large-scale, liquid, purpose-engineered datasets that Scientific AI requires.
In this blog, we’ll share where we stand today, what we’ve learned, and what we’re planning to do next.
It’s all about scientific data and it’s about all your data
TetraScience started its mission by replatforming and engineering one of the most notoriously challenging types of scientific data—instrument data. These datasets are often highly fragmented, mainly trapped in vendor proprietary or specific formats.
Over time, the Tetra library of integrations and data schemas has greatly expanded beyond instrument data and most notably includes:
- Experimental context and design via bi-directional ELN/LIMS integration
- Analysis results via apps accessed through the Tetra Data Workspace
- Data from contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs)
For each endpoint system, TetraScience aims to extract as much of the scientifically meaningful data as possible. In a previous blog post, we shared why our strategy is fundamentally data-driven and AI-native in contrast to a middleware approach that is application-driven and, therefore, more limited by nature.
The largest library and highest coverage
Considering the vast array of scientific data sources within the biopharma industry, how does TetraScience’s library of productized integrations and data schemas measure up? Let's evaluate our library from three different perspectives.
For a typical biopharma
Since the beginning of 2024, three of the many biopharmaceutical companies we are partnering with, have shared their instrument inventory lists with us. This allowed us to identify how to replatform their scientific data to the cloud and enable analytics and AI.
Upon analyzing their lists, we discovered that on average TetraScience’s integrations already supported the integration of over 95 percent of all their lab instruments. Thus, by and large, we could fulfill their immediate needs for lab data automation and scientific data management.
Our process of engineering scientific data involves harmonizing vendor-specific data into Tetra Data, using an open, vendor-agnostic format. Initially, our data schemes supported roughly 60 percent of their instruments. The resulting transformed data (Tetra Data) allows these organizations to harness their scientific data for analytics and AI whenever ready, thereby future-proofing their data strategy.
Digging into the priorities of their instruments, we found that our library covered 100 percent of the highest-priority instruments for data replatforming and data engineering. This is unsurprising since TetraScience's library has been developed based on customer requests and thus supports the most popular scientific use cases. Therefore, it serves biopharma’s expected business outcomes and provides significant value.
For a popular endpoint category
Another way to illustrate TetraScience’s coverage is to map the TetraScience library to common endpoint categories. Here are some examples:
For an end-to-end scientific workflow
Next, let's consider a scientific use case. Bioprocessing involves a series of carefully controlled and optimized steps to produce pharmaceutical products from living cells. A typical workflow is divided into three stages: upstream processing, downstream purification, and characterization and critical quality attribute (CQA) monitoring. Throughout this process, a large number and diversity of scientific data sources are used. Below, we list the endpoints from the Tetra library that cover each stage of the bioprocessing workflow end to end:
Upstream Bioprocessing (Cell Culture Fermentation)
Downstream Bioprocessing and Purification
Characterization and CQA Monitoring
The fastest-growing library
For any data sources not yet supported by TetraScience, you can read about our approach in this blog post: What customers need to know about Tetra Integrations and Tetra Data Schema. In short:
- TetraScience publishes our library and roadmap transparently so you know exactly what is currently in the library and how we plan to grow it: Data replatforming/integration library (layer 1) and Data engineering library (layer 2)
- TetraScience provides monthly newsletters announcing the latest releases for data integrations and data models: TetraConnect News
- Customers can request to add or accelerate items in the roadmap. TetraScience prioritizes requests based on criticality and impact. If there is a component to productize, TetraScience will create and maintain it for all customers.
In the last two years, TetraScience has been able to deliver more than fifty new or material improvements to our library every six months. With the introduction of our Pluggable Connector Framework, TetraScience will further accelerate this tempo.
TetraScience also publishes guidelines to select customers on how to build Intermediate Data Schemas (IDSs), which accelerates their ability to extend the library. For example, this video from our training team teaches users how to create their own pipelines: Tetra Product Short Cuts: Self-Service Tetra Data Pipelines.
In addition to industrializing components for our library, TetraScience has rolled out ready-to-use validation scripts as part of our GxP Package. Our verification and validation (V&V) document set is designed to help customers save as much as 80 percent of their validation effort, allowing them to focus on the “last mile validation.”
Learn more about some of our data replatforming and engineering library:
- Benchling Notebook
- BioRad ddPCR
- Molecular Devices SoftMax Pro
- Perkin Elmer UV Winlab
- Thermo Fisher Chromeleon
- Waters Empower: Tetra Empower Agent v5.1 and v5.2
- Revvity Signals
- Benchling Notebook <> Thermo Fisher Chromeleon bidirectional integration
- Revvity Signals <> Metrohm Tiamo/Mettler Toledo LabX bi-directional integration
The only purpose-built library
Our journey will never be completed. However, we’re eager to share the significant amount of investment and work TetraScience has made to fulfill our promise to the industry. This commitment involves combining our expertise in technology, data, and science to deliver material impact.
Understand and overcome the limitations of endpoint systems
Most of the systems are not designed to be interfaced with data analytics and AI. Their primary function is to execute some scientific workflows, rather than to preserve and surface information for analytics or AI. Here are some of the most challenging situations we have observed:
- Change detection is extremely difficult for common lab data systems, such as chromatography data systems (CDS). A typical CDS controls hundreds of HPLCs, holds data from thousands of projects, and supports hundreds to thousands of scientists. As a result, it can be virtually impossible to efficiently detect modifications or new runs.
- Binary data files are prevalent choices by vendors, and they are only readable inside the vendor’s own analysis software. Sometimes, these vendors provide a software development kit (SDK). However, because the instrument control software must be installed concurrently for the SDK to function, it does not qualify as a true SDK. Also, vendors often restrict any third party from using key libraries in the SDK.
- Data interfaces are often undocumented, incorrect, or not designed for analytics or data automation. For example, some lab data systems can return incorrect or conflicting data if using different interfaces, or fail to handle periodic polling on the order of minutes. Anticipating or reproducing these scenarios is often impossible without large-scale data or real lab facilities.
Understand the science and the scientific purpose
A typical approach in the industry is to focus on the integration of instruments and scientific applications without considering the larger picture of the scientific use case. While having many industrialized and validated integrations and data schemas is undeniably essential, it is critical to also have the scientific workflow and purpose in mind.
- What is the scientific end-to-end data workflow of the scientist?
- Is this workflow part of a larger process?
- What does the scientist want to achieve, and what is the desired outcome?
- Which data and results are relevant to achieve it?
- What is the relevant scientific metadata?
- For which purpose does the scientist need this metadata today (e.g., search, data aggregation, analytics)?
- How might the scientist want to leverage this data later in different applications?
- What other purposes might the scientist have for the data in the future?
- What are the functional and nonfunctional requirements to fulfill the scientific use cases?
Being able to answer these questions will help create the best possible data workflows using suitable integrations and data schemas. To ensure our library is purpose built for science, 48 percent of TetraScience’s staff has a scientific background and 54 percent has advanced degrees (MS or Ph.D.).
Mimic a scientific workflow via live instrument testing
One of the most important lessons we have learned is that scientific data formats vary widely and are subject to different kinds of configurations, assays, hardware modules, and operating systems used. As a result, TetraScience has started to contract with various institutions to perform live testing while scientists conduct real scientific workflows. This ensures that our integrations and schemas perform as intended, delivering value to scientific workflows.
Next steps
TetraScience has, is, and will continue to invest in differentiating capabilities for the replatforming and engineering of scientific data from hundreds of thousands of data silos. This endeavor is combating the widespread fragmentation of the scientific data ecosystem. In 2024, we will:
- Continue to evolve our foundational schema component library
- Adapt our existing schemas to strengthen harmonization across specific scientific workflows and data sources
- Evolve platform architecture to accelerate expansion of the data engineering library
- Rapidly detect edge cases and remediate them through alerting and monitoring
- Deploy and manage components at scale from a centralized platform
- Perform exploratory testing focused on scientific use cases
We are on this journey together
Each of the endpoint systems holds your scientific data. In the modern data world, every organization like yours is demanding seamless access to their data and liquid data flow in and out of these systems. The "tax or tariff" that certain vendors put on your data is no longer acceptable—nor does it have to be. This endpoint-centric data ownership mindset fundamentally does not work in our era of data automation, data science, and AI. Industrializing the building blocks of data replatforming and engineering is inevitable for the industry to move forward.
TetraScience provides the vehicle for this paradigm shift. When there is a tax or tariff on your data, we encourage every biopharma organization to participate in the success of your own data journey. For example, you can:
- Submit justified requests to your endpoint provider, insisting that your data be freely accessible along with the related documentation.
- Involve TetraScience in your planning process. TetraScience can help ensure that your agreements with endpoint vendors include sufficient requirements for openness and liquidity of your data generated or stored in these endpoint systems.
We are on this journey together. Contact one of our experts today.