Blog

A complete toolchain for creating AI-ready scientific data at enterprise scale

June 25, 2024

Spin Wang
Co-Founder and CTO

Scientific data is one of the most devilishly complex categories of data. It presents daunting challenges to life sciences companies that want to collect, organize, and convert their data into valuable insights through AI-driven use cases. One big reason is that scientific data is locked in tens of millions of silos and proprietary data formats. 

Ending the silo nightmare is why TetraScience explicitly chose to build a scientific data platform that’s vendor-neutral, endpoint-agnostic, and data-centric. Our entire business model treats data as a product to be liberated and used freely by our customers to build valuable analytics- and AI-powered use cases. 

Guided by the scientific value we create for customers, TetraScience has steadily invested more than any other company in building a complete toolchain for managing scientific data. This toolchain covers all activities related to your scientific data, including ingestion, processing, harmonization, and analytics. We have amassed the world’s largest, fastest-growing, purpose-built library of software components for unlocking value from scientific data. The Tetra library includes data integration components, such as agents and connectors, and data engineering components, including composable schemas and data apps for scientists to perform analyses. 

I want to share a little about how this toolchain provides biopharmaceutical organizations the flexibility and extensibility they need to assemble and engineer scientific data for analytics and AI. We’ll also share why we’re ready to engage those in our customer community with dual expertise in data and science (we call these rare talents Sciborgs) to contribute to the Tetra library we’re developing with this toolchain. 

Please note that such a toolchain is explicitly designed for organizations that are:

  1. Currently facing or are expecting to face significant challenges with their scientific data in their digital and AI journey.
  2. Serious about establishing a sustainable and scalable enterprise data foundation for an increasing number of scientific use cases instead of a one-off solution focused on single application endpoints for single datasets.

Not every organization will need this level of sophistication, flexibility, and extensibility, but they will in the future.

The scientific data toolchain

Here are the critical frameworks included in this toolchain. When used together, they provide all the flexibility and extensibility an organization needs to achieve full-cycle, end-to-end data management and launch analytics solutions for data/AI or scientific IT teams.

Data Journey Stage Tetra Framework Customer Benefits
Data integration (ingestion or publishing to 3rd party) Self-service connector (aka, Pluggable Connector) Allows TetraScience and customers to deliver integrations as dockerized containers with built-in state management, resilience, log, error handling, etc.
Data processing Self-service pipeline (SSP) Allows TetraScience and customers to create flexible and customizable data processing logic to perform data transformation, calculation, validation, harmonization, and other everyday data processing activities.
Data harmonization Self-service IDS (SS-IDS) Allows customers to create an IDS for their specialized data sets. Can be built on the Tetra Data Schema Library to reuse standard schema components TetraScience uses internally.
Data quality Self-service data app (SS-DA) in the Tetra Data and AI Workspace Allows customers to rapidly iterate and release their Python-based data apps and introduce a scientist-in-the-loop approach for quality checks, data review, annotation, etc.
Data analytics SS-DA in the Tetra Data and AI Workspace (e.g., FPLC Data Explorer) Customers can rapidly iterate and release their Python-based data apps to visualize their data and create analytics dashboards. It introduces a scientist-in-the-loop approach for quality checks, data review, annotation, etc.
Data federation Lakehouse Adoption of Delta Lake architecture allows customers’ scientific data to be shareable with their enterprise data mesh that contains non-scientific data.
Data governance Role-based access control (RBAC) and attribute-based access control (ABAC) Combined with data contextualization, allows data/AI/scientific IT teams to segment data access based on data attributes.
Templates and best practices TetraConnect Hub TetraConnect Hub is a vital resource for scientists, data scientists, data engineers, and IT specialists to engage in community-wide conversations about harnessing the power of their scientific data and sharing best practices for optimized value creation with the Tetra Scientific Data and AI Cloud™.

Ready for Sciborgs to engage

Until now, TetraScience has been the primary driver in building the Tetra library. Using our toolchain, we created thousands of data replatforming components, schemas, apps, and documentation artifacts.

Customers had previously expressed interest in contributing to, accelerating, and extending the Tetra library of components. Although we appreciated their motivation and alignment with our goals, we had to decline these requests. We recognized the need for a ramp-up period during which we had to take full responsibility for the library's quality. We had to grind through all the challenges to build the necessary tooling and infrastructure. Essentially, we understood that if we couldn’t do this ourselves, a community wouldn’t be able to contribute meaningfully and systematically. 

We have passed the initial ramp-up phase, and our platform and library are now ready for a deeper collaboration with our customers. We engage with select Sciborgs (experts in science, data, and technology) from the community because of several key maturity milestones:

  1. Testing and monitoring framework: TetraScience has invested heavily in infrastructure and tooling over the past few years. This includes a comprehensive testing procedure that covers data integrity checks, data mapping tests (field-level comparison), system end-to-end checks, live instrument testing, performance and longevity testing, upgrades and backward compatibility checks, horizontal scaling tests for the agent, error handling and monitoring checks, backup and recovery checks, and functional checks for all features.
  2. Launching the Tetra Data and AI Workspace: The workspace lets our customers create, preview, and release data apps quickly and easily. Customers have also used our TetraScience SDK extensively to build data pipelines. TetraScience will officially open the self-service data application to customers in 2025.
  3. Emergence of internal best practices: Based on years of learning and implementation, we have accumulated a critical mass of best practices. These include optimizing the horizontal scaling of data acquisition from enterprise chromatography data systems, ready-to-use ELN/LIMS schemas for standard assays, and code templates for data mapping.

Interested in contributing? Email your TetraScience point of contact to get started. Our team will provide you with deeper access to our library of data apps, connectors, engineering scripts, harmonization schemas, and best practices guides.