Engineering the World's Scientific Data: Schema, Taxonomy, Ontology, and the Tetra Way

Scientific data is one of the most complex and fragmented datasets in the world. Properly engineering all this data is a decade-long journey. It requires industry-wide collaboration and deep commitment from pure-play data companies like TetraScience, whose mission is to liberate and unify scientific data for automation, analytics, and AI applications.

We’re just getting started.

At TetraScience, we take a bottom-up approach to scientific data engineering, focusing on the specific needs of biopharma companies and concrete scientific use cases/workflows. We start by addressing immediate priorities, then incrementally unlock more value from the customer's data, ultimately scaling this process to the entire enterprise.

This blog post shares our unique approach to scientific data engineering for biopharmas at an enterprise scale.

Replatforming Before Engineering

Before scientific data can be engineered, it first needs to be un-siloed from a plethora of endpoints across the industry. This “data replatforming” step involves centralizing data in the cloud and tagging it with relevant metadata to establish connections with other datasets and provide sufficient lineage to understand the data’s provenance.

Schema

After replatforming, a unique schema needs to be applied to each kind of scientific data. Intermediate Data Schemas (IDSs) is a schema framework designed by TetraScience to capture and structure essential scientific information from the data source (e.g., instruments, analysis software, informatics applications, and contract organizations). The Tetra Scientific Data and AI Cloud transforms replatformed data into an open, vendor-agnostic JSON file format according to the IDS. This conversion allows data from heterogeneous sources, in proprietary or incompatible formats, to be used in a modern data stack.

IDS includes a set of self-service toolings for customers to build their own. You may see the word “Tetra Data” or “Tetra Data Schema” – they represent the IDSs that are prioritized, curated, designed and maintained by TetraScience as part of our library and product offering. Each Tetra Data Schema also comes with the necessary technical components (data replatform and data engineering) to convert the data into such a schema with the documentation needed to aid downstream data consumption and understand the provenance needed to support a customer’s validation activities.

When TetraScience first began the journey of helping biopharma companies harness the value of their scientific data, they came to us with a straightforward request: enable us to extract actionable data from our instruments and contract organizations. However, replatforming scientific data from hundreds of thousands of silos and defining clear schemas is a formidable task with no shortcuts or magic solutions, and many challenges stand in the way.

Challenges with instrument data:

Data is often stored in binary files with no documentation available.
Sometimes the data resides in text sections. While it’s possible to parse the data, its structure frequently changes with each instrument update, model variation, or software adjustment.
When a Software Development Kit (SDK) is available, which is rare, it’s usually designed for instrument control rather than enabling a data interface.

Challenges with CRO/CDMO/CMO data:

CRO/CDMO/CMOs typically deliver unstructured data as email attachments in PDF reports (often over 100 pages) or spreadsheets.
Biopharma companies must invest considerable effort to process and clean this data before it can be used for analysis.

Taxonomy

Taxonomy plays a pivotal role in harmonizing data schemas and accelerating their creation. We build schemas using shared components in an object-oriented approach, ensuring consistency across similar data sources while maintaining the flexibility to accurately reflect each endpoint’s data. This expanding set of components is documented at ids.tetrascience.com, where customers can leverage them to construct their own schemas. Each component element has a precise definition, collectively forming a de facto taxonomy.

A common taxonomy (e.g., for terms like users and samples) makes it easy to search and analyze data across datasets. It also sets the foundation for building "Gold layer" datasets or ontologies, which support advanced analytics and AI applications, as discussed in the next section.

Our schemas and taxonomies are not static; they continuously evolve. As we build more schemas, patterns emerge, highlighting areas for improvement and refinement. You can learn more about this process in our blog post.

Our taxonomy also incorporates controlled vocabularies for key schema fields, enabling companies to map terms according to their own preferences, given that vocabulary differs across organizations.

Ontology

TetraScience’s lakehouse architecture and Delta tables enable us—and our customers—to create Gold layer datasets, following the Medallion architecture, that support advanced analytics and AI. “Views” can be constructed across schemas to capture the relationship and hierarchy of various data fields within specific scientific workflows or use cases, optimizing the data for consumption. These views become the starting point for our ontology.

For example, TetraScience is actively testing some ontologies with early adopters that map to the following workflows:

High-throughput screening
CGT manufacturing
Bioprocess development and optimization
Quality testing
Pre-clinical ADME/Tox studies
Asset utilization

More sophisticated ontologies will be added as we iterate with our customers.

As we build out these, we include a technical transformation mapping from common endpoint schemas and focus our modeling on the practical applications of common analysis questions or activities asked of the data. For example: “How do upstream conditions impact downstream processing purity/yield and CQAs?” We always couple an ontology or data model update with an actual deployable application or programmatic query that will address in a customer-usable way.

Medallion Architecture

Let’s look at the process of data replatforming and engineering in the Tetra Scientific Data and AI Cloud, using the Medallion architecture as our framework:

Bronze layer: Data is replatformed.
Silver layer: Raw-to-IDS pipelines engineer the data, generating schematized datasets aligned with standardized taxonomies.
Gold layer: TetraFlow pipelines generate materialized ontological views of datasets to support analytics and AI applications.

One example of an analytics application that leverages this architecture is Chromatography Insights. Explore our blog post or video to learn how it works and the valuable scientific benefits it offers.

Scientific Data Engineering in the Tetra Lakehouse

Tetra Community Approach

As leaders in scientific data engineering, we have long recognized the importance of community collaboration.

We chose a Tetra Community Approach vs the Open Source approach.

Why Open Source Falls Short

Replatforming and engineering scientific data at an enterprise scale for highly regulated industries requires a unique combination of specialized skills, including:

In-depth knowledge of endpoint systems
Advanced data engineering expertise
Identify scientifically relevant metadata
Familiarity with complex scientific workflows
Robust support for regulatory compliance

Relying on common sense or tools like ChatGPT barely scratches the surface of these complexities. Engineering scientific data demands significant capital investment and specialized knowledge—resources that are challenging to sustain within an open-source model. An open-source approach does not provide the economic structure needed by a commercial company like TetraScience to support such a complex and resource-intensive endeavor.

Open source also poses additional challenges at the enterprise level. Without a dedicated team and commercial support, it becomes difficult to ensure consistent quality aligned with user requirement specifications, predictable release timelines, timely support, and upgrades. Open-source projects lack the “forcing functions” for battle testing, continuous improvement, and validation needed for GxP-compliant use cases. Moreover, establishing formal relationships with vendors demands substantial, long-term effort. This commitment is essential to access the specialized knowledge required for reliably replatforming and engineering complex scientific data including updates and maintenance.

Community Approach

In contrast, we’ve adopted a selective, white-glove community approach.

To support this, we invested in a comprehensive, enterprise-scale data engineering toolchain. Built on open frameworks like JSON, Python, and Streamlit, this toolchain empowers customers to take a self-service approach—whether by creating new data engineering components or contributing to our growing library.

In 2024, we welcomed our first customer contributors into the TetraScience community. Next year, we plan to expand the group with at least 10 more customers who have completed training.

Within the TetraScience customer community, biopharma members can generate or consume data in the IDS framework, build their own schemas, use the taxonomy library, and leverage TetraScience’s materialized ontological views (Gold layer) or even create their own. The transformation logic behind these Gold layers—TetraFlow pipelines—is available to the wider TetraScience ecosystem.

We have made substantial investments in customer training and enablement. We also plan to expand our team of Sciborgs within customer organizations and build a rigorous review process for community contributions. This approach ensures strong, long-lasting support for our community.

Allotrope Simple Model (ASM)

For organizations interested in the Allotrope Simple Models (ASM), TetraScience provides a flexible two-step pathway leveraging the IDS which allows customers to use either IDS or ASM models. Since Allotrope recently incorporated JSON schema into ASM, TetraScience can efficiently map IDS transformations to available ASMs, facilitating compatibility.

Here’s a bit of background.

TetraScience pioneered the concept of the Intermediate Data Schema as a stepping stone to any pre-defined or future model or standard such as: Allotrope models, customers' internal data standards, or customized data views—hence the term “intermediate.” While we initially aimed to adopt Allotrope Data Format (ADF) standards, such as HDF5 or RDF, two main challenges prevented this, and we fell back to “Intermediate Data Schema” as a way to provide immediate value to our customers. These two challenges were

Format complexity: We advocated for simplifications, but these efforts did not gain traction.
Vendor adoption: Instrument manufacturers consistently lack incentives to support data standardization, as their portfolios rely on closed ecosystems. Many vendors maintain proprietary formats to differentiate their software offerings and promotion of their software.

While the Allotrope Data Model organizes data by instrument type or lab technique—a useful approach for harmonizing data across vendors—it can lead to data loss or over-harmonization if there is no intermediate schema. For instance, chromatography systems from Agilent, ThermoFisher, Shimadzu, and Waters each contain unique data fields, making strict standardization impractical.

In contrast, IDS is designed to capture all relevant information from each endpoint, accommodating vendor-specific details. Using common schema components and the Medallion architecture, we achieve a similar level of harmonization while retaining flexibility and preserving all the information.

Over the past few years, Allotrope has incorporated JSON schema into the ASM, allowing TetraScience to programmatically map IDS to ASM, greatly simplifying the transformation.

As a result, we recommend a two-step approach to ASM: first, engineer siloed data into IDS, then transform IDS into ASM. This approach offers several key benefits:

Scalable infrastructure: Access the world’s largest, fastest-growing, purpose-built library of components for data replatforming and engineering
Advanced querying capabilities: Query ASM data within the lakehouse architecture using analytics and AI compute engines, such as Snowflake or Databricks.
High flexibility: Choose IDS or ASM for downstream applications based on use-case requirements.

Conclusion

Engineering scientific data at an enterprise level is a complex and ongoing journey—one that TetraScience is committed to leading. Through our unique approach to schema, taxonomy, and ontology development, we empower biopharma companies to fully utilize their scientific data for analytics and AI.

Ready to learn more about the Tetra IDS? Read this detailed blog post.

‍

FAQ

Is IDS or Tetra Data proprietary?

No. IDS and Tetra Data differ significantly from the proprietary data formats that plague the industry. Proprietary formats exhibit the following hallmarks:

Data cannot be extracted from the vendor software.
Data must be opened or interpreted exclusively with the vendor’s software.
Little or no documentation is available for parsing or interpreting the data, often by design.

In contrast, IDS and Tetra Data are designed to be the exact opposite.

Tetra Data is based on IDS and stored in an open format (JSON and Parquet). Each Tetra Data schema comes with detailed documentation for data mapping, limitations, and instructions for creating Tetra Data using library components.

Our self-service tooling for creating custom schemas is based on Python. Taxonomies, the common components shared across schemas, are also based on Python classes. This means a customer can build Tetra Data schemas and taxonomies in Python-based applications.

All of these are accessible to TetraScience customers. Such data can be interpreted by any third-party software or entity. Even if customers stop working with TetraScience, their Tetra Data can still be easily consumed, and the JSON schema is self-describing.

What ontologies does TetraScience have?

TetraScience has prioritized building representations for scientific workflows first rather than specific measurement techniques, for example.

TetraScience is actively testing some ontologies with early adopters that map to the following workflows:

Bioprocess development and optimization
Quality testing
Pre-clinical ADME/Tox studies
Asset utilization
Cell and gene therapy (CGT) manufacturing

If you’re interested in collaborating, please reach out.

Why don’t you have a formalized ontology represented in RDF?

Our approach aligns with customer preferences, which have always involved JSON or SQL formats for the production and consumption of scientific data. We have not yet received any customer requests for expanded ontological needs that would require RDF.

Are there any semantic engineers at TetraScience?

While we don’t have a specific role called “semantic engineer,” our scientific data engineers have scientific backgrounds and specialize in data engineering, which includes defining ontologies, taxonomies, data modeling, etc. They also work with Sciborgs to cross-check the definitions of the terms for scientific validity.

Why isn’t TetraScience part of the Allotrope Foundation? Any plans to rejoin Allotrope in the future?

We've been one of the first companies to work with Allotrope products and can support any customer interested in transforming data to ASM.

Allotrope’s Ontologies and Simple Model serve as valuable public resources and references for TetraScience's internal development of schemas and taxonomies.

However, formally embracing Allotrope can cause some friction for TetraScience and our customers. Challenges include:

Technical limitations: Allotrope Foundation does not provide components to extract or convert data, increasing costs and TCO to generate data in Allotrope models.‍
Update delays and divergence: It can take up to 5 months for the Allotrope Foundation to update an ASM after a request, depending on the governance process of the community. Member companies that have extended Allotrope models for their own use typically do not contribute back because of the pace of updates. These factors can result in significant delays for TetraScience to provide business value that customers need, as well as cause confusion and burden for maintenance and support.

We remain open to rejoining the Allotrope Foundation if there is sufficient customer demand and utility. We fully believe in the vision of open, vendor-agnostic, harmonized scientific data. However, we have not encountered any customers who require Allotrope products as part of their deployments, which is why we stepped back from the Foundation. We'd gladly rejoin if needed for a customer.

How can I leverage the IDS-to-ASM two-step approach?

TetraScience can develop requested products (e.g., IDS-to-ASM transformations) for any current or future customer because the customer’s Allotrope membership allows for third-party (e.g., TetraScience) development. TetraScience has library components to perform mapping for certain datasets based on prior work and is currently working on automating the conversion of IDS to ASM. TetraScience’s self-service tooling also allows customers to create the IDS-to-ASM conversion with their customized transformation logic.

How would I leverage TetraScience’s schemas and ontologies if already use Allotrope Foundation Ontologies (AFO) or another ontology?

The IDS framework supports references to external ontology nodes as part of the schema definition (which is a representation of the TetraScience schemas and ontologies). Upon request, we will work with a customer to map the TetraScience ontology represented in the data schema to any ontologies. This allows customers to use their preferred ontology tool to reference the appropriate node in the specific data instance representation.

Example H4

Example H5

Reimagine Scientific Data Management

Transform your data. Enable lab data automation. Drive analytics and AI.

Explore how