Introduction
In 2007, Nokia ruled the mobile phone market as a category leader. The company’s text-messaging, video-enabled flip phones were a worldwide phenomenon. Then, one day, Nokia suddenly faced a serious threat—the iPhone. Apple’s service-modeled platform leveraged multiple application developers sharing and co-existing in an open, digital landscape. Within six years, Apple outsold Nokia five to one and became the fastest adopted device in history.
We foresee a similar shift in the life sciences. Legacy, silo-prone, endpoint-specific data storage and management systems will be overtaken by an open, vendor-agnostic, cloud-native, data-centric approach.
In this blog, we will compare the traditional scientific data management system (SDMS) to the modern scientific data cloud, and show how a scientific data cloud provides transformational data management by addressing the emerging challenges within many life science organizations.
What is an SDMS?
An SDMS is like a filing cabinet. It captures, catalogs, and archives all versions of each data file from scientific instruments (such as high-performance liquid chromatography systems, mass spectrometers, flow cytometers, sequencers, etc.) and scientific applications (such as LIMS, ELNs, analysis software, and AI applications). An SDMS can provide access to data, facilitate compliance with regulatory requirements, and enable simple workflows around data management.
Essentially, an SDMS is a file server and data store. There is no question that SDMS solutions are improvements over the ’90s-era process of maintaining quality documents on paper reports. Unfortunately, for biopharma organizations that are trying to enter the digital and AI world, SDMS solutions often become data graveyards. Once data is tossed into the SDMS, that data is difficult to reuse and quickly forgotten. According to extensive customer interviews we’ve conducted with biopharma organizations, the primary reason SDMS solutions are kept online is for compliance reasons.
What is a scientific data cloud?
A scientific data cloud is a modern data solution that provides an end-to-end solution to unlock the value of scientific data. It facilitates the bidirectional flow of scientific data between scientific instruments, collaborators, informatics applications, analytical applications, and AI applications.
Every life science organization's scientific data will inevitably go through a data maturation journey. A scientific data cloud’s job is to facilitate the assembly of scientific data and move that data up the pyramid toward maturation. To do so, it must provide a full-stack data experience by collecting data from disparate sources, engineering the data into a harmonized, vendor-agnostic format, contextualizing data, and enabling data for analytics and AI/ML.
Your journey with scientific data - Unlock value at each level
SDMS solutions, by comparison, are traditionally limited to simple data integrations and storage, which is limiting them to layer 1 of this pyramid. As a result, SDMS solutions become dead ends, preventing the seamless movement of data “up the pyramid.” Additionally, contextualization through metadata extraction is often limited or absent within an SDMS, making data hard to search and retrieve.
The scientific data cloud embraces several fundamentally different design and architectural principles than an SDMS.
First, a scientific data cloud is purpose built for scientific data, scientific workflows, and use cases, such as analytics and AI. For example, a scientific data cloud:
- Supplies a means to integrate with the entire ecosystem of scientific instruments
- Possesses data engineering components (schema/taxonomy and transformations) that are tailored to scientific data and function within science-based informatics applications, analytical applications, and AI applications
- Provides GxP capabilities so life science organizations can leverage their data within validated workflows
A scientific data cloud must be purpose built for facilitating the data journey, not simply the assembly of horizontal technical components. It must be tailor-made for scientific data. Moreover, a scientific data cloud is more than just software. It should incorporate the expertise of data and science specialists who are biased toward achieving business outcomes.
Second, a scientific data cloud is vendor agnostic. It is not attached to, or owned by, any endpoint system. It remains neutral to avoid putting constraints on the customer’s choice of data sources and data targets. By comparison, all the SDMS providers are either instrument manufacturers or informatics application providers. This places extreme structural limitations on an SDMS solution’s ability to integrate scientific data from an entire ecosystem, limiting true data liquidity.
Third, a scientific data cloud converts raw scientific data to analytics- and AI-ready data. It delivers this engineered scientific data to different applications, such as data applications and AI/ML tools, and enables multi-modal data consumption. Only data that is purposefully harmonized and contextualized for scientific workflows and applications can be compared and used for advanced analytics and AI. Data that doesn’t have these characteristics will deliver substandard results and outcomes.
Lastly, a scientific data cloud is designed to facilitate and enable cross-organization collaboration. Innovation in today’s life science environments depends on the seamless flow of data between and among scientific teams, biopharmaceutical companies, CxOs, and the larger scientific ecosystem.
The advantages of a scientific data cloud
Since all biopharma scientific data inevitably goes through the four layers of the scientific data journey, let’s explore the advantages provided by a scientific data cloud over a traditional SDMS in each of the four layers.
Layer 1: Flexible data integration and processing
SDMS solutions are typically designed to store files exported from instrument control or processing software. However, labs are full of data sources that do not produce actual files, including:
- Complex instrument control or data systems (such as CDS solutions, historians, etc.): These require programmatic integration with the endpoint’s SDK or API.
- LIMS/ELN and informatics applications: These require programmatic integration as well so users can access important information such as experiment design, sample descriptions, and test requests.
The challenge is that heterogenous scientific data needs to be integrated if life science organizations want to achieve highly customized, automated data flows, or build toward AI capabilities. To address this challenge, a scientific data cloud provides an adaptable, configurable data integration framework via modularized data pipelines and connector frameworks, which allows all scientific data to be replatfomed from its individual silos.
Separately, on layer 1, a scientific data cloud provides configurable data processing that enables metadata extraction from the data set, proper contextualization with user-defined terms or tags, and data integrity checks. Leveraging these capabilities, metadata can be applied based on the file path, file content, and information stored in third-party applications. And sophisticated metadata then enables highly configurable attribute based access control. Data processing further initiates the submission of the data to downstream applications.
A scientific data cloud is adaptable to custom requirements as well. A user can configure data processing based on their own business logic, reprocess the files using another data pipeline, and create their own customized data extraction by creating custom parsers and other scripts (e.g., in Python) in a self-service fashion. They can also merge data from other data sets, dynamically perform quality control or verification of the data sets, and integrate the data into other informatics applications, such as ELN/LIMS applications.
In short, a scientific data cloud provides the functionality for data to flow freely, with high liquidity and flexibility, and to make this data findable. This flow and accessibility is just as important as data capture and data storage. It reinvigorates the siloed, stale experimental data, and helps enable fully automated workflows across the value chain.
Simply put, traditional SDMS solutions cannot accommodate the types of sophisticated data flows required by modern life science organizations. As a result, data flows grind to a halt and data becomes static.
Layer 2: Scientific use case driven data engineering
The static nature of many SDMS solutions makes them closed systems. An SDMS can bring files together in one place for archive purposes, but it fails to turn files into actionable data, prepared for basic data science and analytics. These limitations are due to the lack of a critical capability in traditional SDMS solutions: data engineering. As a result, traditional SDMS solutions are often considered “data graveyards.”
Some of the most important characteristics of a scientific data cloud are its ability to convert raw data to an analytics- and AI-ready format, and to provide an interface that enables data applications, analytics, and AI workloads to consume this data. This is achieved by an Intermediate Data Schema (IDS).
A scientific data cloud contains numerous IDS and detailed documentation of the required data transformations
A well-designed IDS captures scientifically meaningful, relevant information from vendor-specific or proprietary file formats produced by a data source. It is designed in collaboration with instrument manufacturers, scientists, and data scientists. This depth of knowledge enables the scientific data cloud to engineer numerous proprietary data formats into harmonized, open, and vendor-agnostic formats, while simultaneously harmonizing taxonomies and ontologies of metadata. This data is inherently FAIR (meeting the principles of being findable, accessible, interoperable, and reusable), whereas most data within a typical SDMS system can only achieve limited interoperability and/or reusability.
By leveraging an open and data science–friendly IDS while cataloging, indexing, and partitioning data to support API-based queries, a scientific data cloud enables biopharma companies to combine data with some of the most popular search and distributed query frameworks. Doing so ensures that the engineered data can be queried, accessed, and consumed by common analytics and AI tools.
The IDS and scientific data cloud are designed to be “data-first,” driven by the underlying scientific workflows, not limited by the temporaneous and snapshot schema designed in a particular application. The scientific data cloud should extract as much data from the raw data as possible. It should maintain a stable, consistent set of schema, clean taxonomy, and use case–oriented ontologies so that analytics and AI can be established on a comprehensive and stable data structure.
An SDMS does not have any of these advanced engineering capabilities and therefore cannot help life science organizations progress up the data pyramid. This limitation is a major roadblock for organizations relying on an SDMS for data management because raw data is not engineered. Thus it is never ready to be queried or analyzed without significant curation and transformation efforts.
Layer 3: Universal access to scientific data and apps
Some SDMS solutions might provide a preview for one individual run or file. However, these visualizations are not optimized for aggregated insights from various data sources/vendors or supporting trending or clustering. The visualizations are file based, not data based.
By contrast, a scientific data cloud provides universal access to all data applications, allowing users to leverage the centralized, engineered data that’s been assembled from numerous sources across the organization, leveraging a common data interface. Users can select the tool of their choice, including best-in-class tools, such as Jupyter Notebook, R Shiny, Spotfire, etc.
The scientific data cloud also leverages state of the art data lake/lakehouse/warehouse/governance architecture to support scalable analytics workloads.
Scientific data cloud-powered data apps translate raw data into actionable insights quickly
A scientific data cloud should also provide a flexible data application framework where visualization or analytics applications can be published and shared across the community. Such data applications can include the commercially available analysis software or data applications built by end users based on their unique workflow.
This framework is crucial due to the diversity of scientific data and analytics scientists would like to perform. The analytics or visualization cannot be static and has to enable community collaboration, sharing, and rapid iteration.
With traditional SDMS solutions, analyzing data is typically isolated from the SDMS: Scientists need to retrieve the files from the SDMS and then perform their analysis. Analysis results may or may not be entered back into the SDMS since it’s a tedious manual process. The result is lost lineage and context. The scientific data cloud provides a holistic and cloud-native data app infrastructure where lineage and context are tracked automatically.
Layer 4: Enabling scientific AI via a hybrid skill set
Across the biopharma industry, the journey toward the automated lab, high-value engineered data, and analytics and AI-driven scientific insights is underway, but it is still an ongoing process for almost every single company. Change management is essential. Establishing the trust of scientists requires a deep understanding of scientific use cases, how data is generated within workflows, how that data is consumed, and an unwavering focus on meaningful scientific and business outcomes.
This challenge is especially obvious in layer 4 of the data pyramid, the AI layer, where cutting-edge technologies are used to drive groundbreaking scientific outcomes. Organizations can be intimidated by unknowns and slowed by inertia. They might leverage AI as a hammer looking for a nail without building the necessary foundation first.
A key characteristic of a scientific data cloud is that it is more than just software. It also provides experts with skills at the intersection of science, data, and business outcomes. Ideally they are embedded within scientific and IT teams within biopharma companies to maximize the impact.
A scientific data cloud also includes a variety of essential tools—such as survey templates for scientists, training courses, and templates for common scientific use cases—to give biopharmas the best chance for success.
Summary
The scientific data journey is immutable: To capitalize on AI and analytics, life science organizations must produce more mature, AI-ready data. Leaders will quickly recognize that a traditional SDMS—the data equivalent to a “flip-phone”—simply won’t include the features they need today, and they certainly won’t deliver the features they need in 5 to 10 years. Even if your initial requirement is simply to collect and centralize all your scientific data in one place, locking data into an SDMS means locking it into the first layer of the data pyramid, permanently.
It’s time to reconsider your SDMS strategy and move to an innovative and future-proof scientific data cloud.
By doing so, you can position your company for maximum flexibility, enable impactful AI/ML with data from more connected instruments and devices, and benefit from best-of-breed technologies to fully leverage the data within your organization today and in the future.
Reach out to your TetraScience experts to see how the Tetra Scientific Data Cloud can help your organization accelerate scientific discovery and empower your scientific team with harmonized, AI-enabled data in the cloud.