Blog

Engineering the World's Scientific Data: Schema, Taxonomy, Ontology, and the Tetra Way

October 17, 2024

Spin Wang
Founder, President, and CTO

Scientific data is one of the most complex and fragmented datasets in the world. Properly engineering all this data is a decades-long journey. It requires industry-wide collaboration and deep commitment from pure-play data companies like TetraScience, whose mission is to liberate scientific data for automation, analytics, and AI applications. 

We’re just getting started. 

At TetraScience, we take a bottom-up approach to scientific data engineering, focusing on the specific needs of biopharma companies. We start by addressing immediate priorities, then incrementally unlock more value from the customer's data, ultimately scaling this process to the entire enterprise.

This blog post shares our unique approach to data engineering for biopharmas at an enterprise scale.

Replatforming Before Engineering

Before scientific data can be engineered, it first needs to be un-siloed from a plethora of endpoints across the industry. This “data replatforming” step involves centralizing data in the cloud and tagging it with relevant metadata to establish connections with other datasets and provide sufficient lineage to understand the data’s provenance.

Scientific Data Journey

Schema

After replatforming, a unique schema needs to be applied to each kind of scientific data. Intermediate Data Schemas (IDSs), designed by TetraScience, capture and structure essential scientific information from the data source (e.g., instruments, analysis software, informatics applications, and contract organizations). The Tetra Scientific Data and AI Cloud transforms replatformed data into an open, vendor-agnostic JSON file format according to the IDS. This conversion allows data from heterogeneous sources, in proprietary or incompatible formats, to be used in a modern data stack.

When TetraScience first began the journey of helping biopharma companies harness the value of their scientific data, they came to us with a straightforward request: enable us to extract actionable data from our instruments and contract organizations (CxOs). However, replatforming scientific data from hundreds of thousands of silos and defining clear schemas is a formidable task with no shortcuts or magic solutions, and many challenges stand in the way.

Challenges with instrument data:

  • Data is often stored in binary files with no documentation available.
  • Sometimes the data resides in text sections. While it’s possible to parse the data, its structure frequently changes with each instrument update, model variation, or software adjustment.
  • When a Software Development Kit (SDK) is available, which is rare, it’s usually designed for instrument control rather than enabling a data interface.

Challenges with CRO/CDMO/CMO data:

  • CRO/CDMO/CMOs typically deliver unstructured data as email attachments in PDF reports (often over 100 pages) or spreadsheets.
  • Biopharma companies must invest considerable effort to process and clean this data before it can be used for analysis.

Taxonomy

Taxonomy plays a pivotal role in harmonizing data schemas and accelerating their creation. We build schemas using shared components in an object-oriented approach, ensuring consistency across similar data sources while maintaining the flexibility to accurately reflect each endpoint’s data. This expanding set of components is documented at ids.tetrascience.com, where customers can leverage them to construct their own schemas. Each component element has a precise definition, collectively forming a de facto taxonomy.

A common taxonomy (e.g., for terms like users and samples) makes it easy to search and analyze data across datasets. It also sets the foundation for building "Gold layer" datasets or ontologies, which support advanced analytics and AI applications, as discussed in the next section.

Our schemas and taxonomies are not static; they continuously evolve. As we build more schemas, patterns emerge, highlighting areas for improvement and refinement. You can learn more about this process in our blog post.

Our taxonomy also incorporates controlled vocabularies for key schema fields, enabling companies to map terms according to their own preferences, given that vocabulary differs across organizations.

Ontology

TetraScience’s lakehouse architecture and Delta tables enable us—and our customers—to create Gold layer datasets, following the Medallion architecture, that support advanced analytics and AI. “Views” can be constructed across schemas to capture the relationship and hierarchy of various data fields within specific scientific workflows or use cases, optimizing the data for consumption. These views become the starting point for our ontology.

For example, TetraScience is actively testing ontologies with early adopters that map to scientific workflows, such as:

  • High-throughput screening for lead screening 
  • Cell and gene therapy (CGT) manufacturing 
  • Bioprocess development and optimization 
  • Quality testing 
  • Pre-clinical ADME/Tox studies

We will introduce more advanced ontologies as we iterate with our customers. Each ontology will include a transformation mapping from common endpoint schemas and focus on practical applications for key analytical questions and data-related activities.

Medallion Architecture 

Let’s look at the process of data replatforming and engineering in the Tetra Scientific Data and AI Cloud, using the Medallion architecture as our framework:

  1. Bronze layer: Data is replatformed.
  2. Silver layer: Raw-to-IDS pipelines engineer the data, generating schematized datasets aligned with standardized taxonomies.
  3. Gold layer: TetraFlow pipelines generate materialized ontological views of datasets to support analytics and AI applications.

One example of an analytics application that leverages this architecture is Chromatography Insights. Explore our blog post or video to learn how it works and the valuable scientific benefits it offers. 

Scientific Data Engineering in the Tetra Lakehouse

Tetra Community Approach

As leaders in scientific data engineering, we have long recognized the importance of community collaboration. To support this, we invested in a comprehensive, enterprise-scale data engineering toolchain. Built on open frameworks like JSON, Python, and Streamlit, this toolchain empowers customers to take a self-service approach—whether by creating new data engineering components or contributing to our growing library.

In 2024, we welcomed our first customer contributors into the TetraScience community. Next year, we plan to expand the group with at least 10 more customers who have completed training.

Why Open Source Falls Short

Our approach is not open source, and here’s why:

Replatforming and engineering scientific data at an enterprise scale for highly regulated industries requires a unique combination of specialized skills, including:

  • In-depth knowledge of endpoint systems
  • Advanced data engineering expertise
  • Identify scientifically relevant metadata
  • Familiarity with complex scientific workflows
  • Robust support for regulatory compliance

Relying on common sense or tools like ChatGPT barely scratches the surface of these complexities. Engineering scientific data demands significant capital investment and specialized knowledge—resources that are challenging to sustain within an open-source model. An open-source approach does not provide the economic structure needed by a commercial company like TetraScience to support such a complex and resource-intensive endeavor.

Open source also poses additional challenges at the enterprise level. Without a dedicated team and commercial support, it becomes difficult to ensure consistent quality aligned with user requirement specifications, predictable release timelines, timely support, and upgrades. Open-source projects lack the “forcing functions” for battle testing, continuous improvement, and validation needed for GxP-compliant use cases. Moreover, establishing formal relationships with vendors demands substantial, long-term effort. This commitment is essential to access the specialized knowledge required for reliably replatforming and engineering complex scientific data, including updates and maintenance.

In contrast, we’ve adopted a selective, white-glove community approach.

Within the TetraScience community, biopharma members can generate or consume data in the IDS format, build their own schemas, use the taxonomy library, and leverage TetraScience’s materialized ontological views (Gold layer) or even create their own. The transformation logic behind these Gold layers—TetraFlow pipelines—is available to the wider TetraScience ecosystem.

We have made substantial investments in customer training and enablement. We also plan to expand our team of Sciborgs within customer organizations and build a rigorous review process for community contributions. This approach ensures strong, long-lasting support for our community.

Allotrope Simple Model (ASM)

For organizations interested in the Allotrope Simple Model (ASM), TetraScience provides a flexible, two-step pathway using IDS. This allows customers to adopt either IDS or ASM formats and even pivot depending on specific use cases. Since Allotrope recently incorporated JSON schema into ASM, TetraScience can efficiently map IDS transformations to ASM, facilitating compatibility.

Here’s a bit of background.

TetraScience pioneered the concept of the Intermediate Data Schema as a stepping stone toward the Allotrope model, customers' internal data standards, or customized data views—hence the term “intermediate.” While we initially aimed to adopt Allotrope Data Format (ADF) standards, such as HDF5 or RDF, two main challenges prevented this:

  • Format complexity: We advocated for simplifications, but these efforts did not gain traction.
  • Vendor adoption: Instrument manufacturers generally lack incentives to support data standardization, as their portfolios rely on closed ecosystems that are difficult to integrate. Many vendors maintain proprietary formats to differentiate their software offerings and ensure customer lock-in.

While the Allotrope Data Model organizes data by instrument type or lab technique—a useful approach for harmonizing data across vendors—it can lead to data loss or over-harmonization. For instance, chromatography systems from Agilent, ThermoFisher, Shimadzu, and Waters each contain unique data fields, making strict standardization impractical.

In contrast, IDS is designed to capture all relevant information from each endpoint faithfully, accommodating instrument-specific details. Using common schema components, we achieve a similar level of harmonization while retaining flexibility.

Over the past few years, Allotrope has incorporated JSON schema into the ASM, allowing TetraScience to programmatically map IDS to ASM, simplifying the transformation.

As a result, we recommend a two-step approach to ASM: first, engineer raw data into IDS, then transform IDS into ASM. This approach offers several key benefits:

  • Scalable infrastructure: Access the world’s largest, fastest-growing, purpose-built library of components for data replatforming and engineering
  • Advanced querying capabilities: Query ASM data within the lakehouse architecture using analytics and AI compute engines, such as Snowflake or Databricks.
  • High flexibility: Choose IDS or ASM for downstream applications based on use-case requirements.

Conclusion

Engineering scientific data at an enterprise level is a complex and ongoing journey—one that TetraScience is committed to leading. Through our unique approach to schema, taxonomy, and ontology development, we empower biopharma companies to fully utilize their scientific data for analytics and AI.

Ready to learn more about the Tetra IDS? Read this detailed blog post.