Blog

From warehouse to waterfront: Why scientific data needs the lakehouse lifestyle

July 10, 2024

Vinay Joshi

The data problem

The biopharma industry relies heavily on data to drive innovations in drug discovery and development. The 4 Vs of big data summarize the characteristics of incoming data and the requirements for effective analytics and AI/ML:

  • Volume: The large amounts of data required for AI/ML necessitate scalable storage and powerful processing capabilities. For example, many microscopy applications create large volumes of imagery. 
  • Velocity: Data can be produced at a rapid pace, requiring systems that can ingest, analyze, and act on it quickly. For instance, handling information related to batch characteristics in product supply chains demands swift data processing.
  • Variety: Scientific workflows encompass diverse data types, including structured data from databases, semi-structured data like JSON and XML, and unstructured data such as text, images, and documents. For example, bioprocessing involves data from multiple sources, including bioreactors, chromatography data systems, plate readers, electronic lab notebooks, and laboratory information management systems.
  • Veracity: For AI/ML and trusted analytics, it is crucial that data is clean, harmonized, accurate, reliable, and of verifiable quality. For instance, ensuring data integrity is essential in quality control (QC) processes.

Limitations of warehouse and data lake architectures

To deal with the 4 Vs, traditional data experts currently use a combination of a data warehouse for working with structured data and a data lake to store and use semi-structured and unstructured data. This model, however, has serious limitations and implications.

Data duplication and silos

In this model, data is inherently siloed between structured and semi-structured data. Often, some attributes from the semi-structured JSON data in the data lake need to be made available in the data warehouse, and some columns from the data warehouse need to be accessible for analytics on the data lake data. Moving and/or copying data between these silos can be complex and time-consuming, leading to inconsistencies and delays in data availability. This fragmentation results in the lack of a single source of truth for all data.

Dataset discovery 

Since the data is scattered across the lake and warehouse, discovering schemas and data is difficult. This further compounds the duplication problem, with each sub-organization taking the easy way out and creating its own “customized copy” of the original data that can’t be kept in sync. 

Cost and complexity

Traditional high-performance data warehouses can be extremely expensive. Maintaining both a data warehouse and a data lake requires significant administrative effort and resources, increasing the complexity and cost of the data infrastructure.

Data governance and security

While data warehouses have robust governance and security infrastructures, these governance rules do not extend to the data lake. This discrepancy makes it difficult to enforce unified policies across these silos, leading to security and compliance nightmares. 

Performance and scalability

Traditional data warehouses struggle to scale as data volumes reach petabyte levels and beyond. Even with distributed query engines, data lakes face performance challenges due to the variety of data formats, such as JSON, CSV, and XML. These formats are not optimized for on-disk, network, and in-memory use and querying.

Evolving shape of data

As the structure of incoming data changes, propagating these changes to the traditional data warehouse is often difficult or impossible, especially when downstream analytics applications and business intelligence (BI) tools depend on fixed database table structures. In short, schema evolution is virtually impossible, and simultaneously combining, aggregating, and processing data from multiple schemas and datasets is an uphill task. 

Data sharing and collaboration

In a world where collaboration is essential within and outside organizations, how do you share and collaborate on data residing within a combination of a warehouse and a lake? It becomes a cumbersome task that requires many workarounds.

The list goes on

This combination suffers from a lack of data lineage tracking, no ACID (atomicity, consistency, isolation, durability) guarantees on the data lake, heavy dependence on cloud-native and proprietary formats, data versioning issues, validation of schemas in the data lake only on reads, and generally delayed, inaccurate, and untrustworthy insights.

limitations of data warehouse and data lake architectures

Guiding questions

Let's explore some questions to envision a solution to the warehouse and data lake dilemma: 

  • What if we had one unified storage format to store and access structured, unstructured, and semi-structured data?
  • What if this storage format was efficient, performant for querying, contained schema and metadata about the underlying data, supported evolving schemas and shapes of data, had ACID capabilities like traditional databases, and did not have vendor lock-in?
  • What if there was a data lineage mechanism that could span tables and files, pinpointing how data was transformed to arrive at a particular interesting attribute value in a BI report?
  • What if these datasets were discoverable just like standard database tables and views, with a single unified governance mechanism across tables, cloud storage, and files?
  • What if high-performance compute and SQL engines of the user’s choice could be utilized to process, query, and gain insights into all this data?

The lakehouse architecture

The lakehouse architecture promises to provide positive answers to all these questions. The pivotal components of a lakehouse architecture include

  • Open data and storage format: The data format should be a binary storage format that is query efficient, supports schema evolution, is ACID compliant, inexpensive, and open source.
  • Data catalog: A data catalog provides lineage, security, governance, and stewardship across cloud storage, cloud artifacts, and tables.

The task of providing clean, organized, and high-quality data for various use cases is achieved through a layered data processing architecture. It transforms raw data (bronze) into refined, cleaned, and harmonized data (silver), and then further into aggregated datasets (gold) as needed. This processing architecture is referred to as the “Medallion” architecture. Thus,

Lakehouse Architecture = Open Storage Format (Delta, Iceberg, Hudi) + Catalog + Medallion Architecture

Lakehouse with the Tetra Scientific Data and AI Cloud

At TetraScience, we have embarked on the next-generation lakehouse architecture enablement. We have adopted the ubiquitous Delta storage format as our open table storage format of choice. Embracing the paradigm of openness, our platform will provide the necessary components to support external and third-party query engines to use delta tables residing within the platform.

We have also created “Tetra Flow”, a configuration-based data transformation engine. Tetra Flow allows arbitrary transformation of one or more source datasets into one or more target schemas using the familiar SQL language. It promises extensibility, enabling you to customize transformation steps or write custom sources, processing steps, or target schemas when needed.

Tetra Flow data lakehouse

Customer benefits with lakehouse and Tetra Flow

Lightning-fast query performance

Use your preferred SQL query engine, whether it's Athena, Redshift, Snowflake, or Databricks. Delta tables organize data efficiently, supporting fast and efficient querying.

Data collaboration and sharing

Delta tables can be shared using the open-source, zero-copy Delta Sharing protocol, allowing secure data sharing within and across organizations. Many BI tools, like Tableau and Power BI, already support Delta Sharing. Popular programming languages such as Python, Java, and Scala have APIs to use data shared via Delta Sharing.

Create custom schemas, ontologies, and taxonomies

Tetra Flow allows pipelines to read multiple data sources (now stored as Delta tables), run a series of transformations including SQL joins, filters, and aggregations, and then save the output into one or more target datasets. This capability, combined with Delta Sharing, enables the joining of internal and external datasets to create custom ontologies and taxonomies.

Schema evolution support

No more versioned tables. Delta tables support schema evolution and modification of table column structure (following some rules, of course) without having to create multiple versions of tables. Every time an attribute is added to instrument data, a new column will be created in the existing schema and table without the need to create yet another set of versioned tables.

Reduced storage costs

Tetra Data will be stored in Delta tables. The underlying storage format Parquet, which is binary, compressed, columnar, and open source.  As we implement this architecture, Tetra Data (IDS JSON) will be transformed directly into Delta tables to enable customers more quickly. 

Easy schema discovery and data retrieval

No more searching for isolated JSON files or data. All Tetra Data and any custom-created schemas will be registered into a catalog, visible in the familiar setting of databases and tables that analysts, scientists, and engineers are accustomed to.

Governance, security, and lineage

A lakehouse-compliant catalog will unlock unified governance and security mechanisms across all data and track the lineage of data as it transitions from raw to harmonized to aggregated datasets.

Artificial intelligence and machine learning

Tetra Data will now be clean, include metadata, and be stored in Delta tables. This harmonized, feature-rich data is an ideal substrate for unleashing the full power of AI/ML. With Tetra Data, users can reclaim 70 percent of their time previously spent on data preparation for AI/ML and focus more on developing models.

Large language model (LLM) ready

The intent is to capture as much contextual information about the data and store it in Delta tables. This is a prerequisite for creating meaningful LLMs and minimizing hallucinations.

Conclusion

The lakehouse architecture offers a powerful solution to biopharma's data challenges, providing a unified storage format, robust data catalog, and efficient data processing through the Medallion architecture. At TetraScience, our adoption of Delta tables and the development of Tetra Flow enable seamless data querying, collaboration, and transformation. This approach ensures clean, harmonized data that is ideal for AI/ML applications and simplifies governance and security. By embracing the lakehouse architecture, we empower our customers to unlock new opportunities for innovation and efficiency in drug discovery and development.