Life Sciences’ Data Problem, and Why “Do-it-Yourself” Doesn’t Work
Biopharma professionals are on a mission to accelerate discovery and improve human life — exploiting rapidly-evolving technologies for analytics, AI/ML, and automation to shorten time to market for new therapeutics. This need has led to a rapid, industry-wide paradigm shift in how scientific data is understood and valued:
- Stakeholders throughout biopharma organizations, from bench and data scientists to R&D IT professionals, manufacturing and compliance specialists, and executives, now recognize that data quantity, quality, validity, and accessibility are major competitive assets
- Beneficiaries beyond bench scientists: Data scientists, tech transfer, external collaborators, procurement, operations, strategic and business management, and multiple other potential stakeholders must now be considered data consumers and producers — their ability to improve scientific and business outcomes depends on their ability to access high-quality data easily, and provide high-quality data back to groups and systems originating it
- Meanwhile, organizations are automating and replatforming dataflows to the cloud to enhance access, protect against data loss, leverage elastic compute/storage, and trade capital expenditure for operational expenditure, amongst other benefits
Mastering data changes the game. Biopharma organizations labor to make data work harder, seeking to speed scientific insight, enable new applications, and improve business outcomes. Practical uses for aggregated data surround us — in fundamental science, lab automation, resource management, quality control, compliance and oversight (for further examples, see our blog about use cases for harmonized data from Waters Empower Data Science Link (EDSL)). Deep Learning technology holds out the promise of being able to discover new value, hidden in data. Applications for data analytics, including AI/ML, range from predictive maintenance on lab and manufacturing equipment to discovering novel ligands within huge small molecule datasets and predicting their biological effects.
The Snarl of Scientific Data Sources, Targets, and Workflows
Managing data across a whole biopharma organization is, however, a daunting challenge. Life sciences R&D and manufacturing notoriously suffer from fragmentation and data silos — with valuable data produced (and also trapped) in myriad locations, formats, systems, workflows, and organizational domains. How can biopharma begin to cope with this complexity?
We find it helpful to think of each workflow in terms of a minimum quantum of organizational and logical effort. To gain scientific or business benefit, you need to find and move information from where it’s created or resides (a data source) to a system or application that can usefully consume it (a data target).
Some common data sources include:
- Lab instruments and instrument control software
- Informatics applications, like Electronic Lab Notebooks (ELNs) and Lab Information Management Systems (LIMS)
- Contract Research Organizations (CROs) and Contract Development Manufacturing Organizations (CDMOs)
- Sensors and facility monitoring systems
- SaaS systems for general data usage (Egnyte, GMail, Box)
Data targets, on the other hand, are systems that consume the data to deliver insights and conclusions or reports. For example:
- Data science-oriented applications and tools, including visualization and analytics tools like Spotfire, and AI/ML tools and platforms such as Streamlit, Amazon SageMaker, H2O.ai, Alteryx, and others
- Lab informatics systems, including LIMS, Manufacturing Execution Systems (MES), ELNs, plus instrument control software like Chromatography Data Systems (CDS) and lab robotics automation systems. Note that these data targets are also treated as data sources by certain workflows
Integrating these sources and targets is seldom simple (we comment on some of the reasons for this complexity, below). In many organizations, moving data from sources to targets remains partly or wholly a manual process. As a result:
- Scientists’ and data scientists’ time is wasted in error-prone, manual data collection and transcription. Time taken up by manual work on data reduces time available for analysis and gaining insights
- Meanwhile, pressure to collaborate, distribute research, and speed discovery introduces further challenges in data sharing and validation, often resulting in simplistic procedures that rob data of context and long-term utility, and may compromise data integrity and compliance
Seeking to automate some of these transactions, biopharma IT teams often struggle to build point-to-point integrations connecting data sources and targets, frequently discovering that their work is fragile, inflexible, and difficult to maintain. We’ve written several blogs (e.g., Data Plumbing for the Digital Lab, What is a True Data Integration Anyway?, and How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations) on the complexities of integration building, inadequacies of point-to-point approaches, and requirements for engineering fully-productized, maintainable integrations.
“Lego” Solution Assembly: More Complexity
This frustrating experience with pure point-to-point integrations leads many biopharma IT organizations to begin considering a different solution: building a centralized data repository (i.e., a data lake or data warehouse) and using it to mediate connections between data sources and targets. A common approach is to try assembling such a solution from a plethora of available open source and proprietary, industry-agnostic components and services for data collection, storage, transformation, and other functions.
A recent article from Bessemer Ventures describes versions of a componentized architecture. Note that none of these components are purpose built for scientific data.
The Problem with “Do-it-Yourself”
The challenges we’ve solved with 20+ global biopharma companies have convinced us that there are two major problems with this approach:
Problem #1: Organizational Spread
None of the components of a cloud data platform – including data lakes/warehouses, integration platforms, batch processing and data pipeline systems, query engines, clustering systems, monitoring and observability — are simple, or “just work” out of the box. Most are complex: serious configuration effort, experimentation, and best practices knowledge are required to make each component do its job in context, and enable all components to work together well. You need to develop serious external tooling (which will require specially-trained headcount to create, maintain, and operate) to make updates, scaling, and other lifecycle management operationally efficient and safe in production.
Specialized (but non-strategic) skills required. To execute, you’ll need to assemble teams with specialized skills — each managing as few as one vendor, component, or subsystem of the complete solution — plus software architects and project managers to orchestrate team efforts. You'll also need expertise in GxP-compliant software design, data integrity, and security, and a cadre of data engineers to work with scientists, create integrations and workflows, and help prepare stored data for use by automation, analytics, and AI/ML.
You’ll need these teams long term since you’ll be responsible for evolving, scaling, and maintaining the full solution stack plus a growing number of integrations. While this expert headcount is critical to the timeliness and success of your project, they’re also a cost center — focused on running, integrating, and scaling your platform, but outside the critical path of extracting value from scientific data and helping you do business faster and better.
A data science manager at a global biopharma organization comments, “This organizational spread creates bottlenecks, slows down operations, and in turn, delays data usage. The additional need to ensure that a data ingestion pipeline is GxP-validated further increases this problem — in fact, it might even add an additional organizational unit to the mix!”
Focus on the big picture becomes compromised. Meanwhile, as teams around the project grow, cost and time pressure can urge delivery of a minimum viable product quickly. Focusing on low-hanging fruit and immediate requirements can easily lead to a partial solution that doesn't scale well or generalize to many use cases, and that may be unmaintainable.
Two practical use case examples illustrate the need for a feature-rich, life sciences data-focused, end-to-end solution:
- In high-throughput screening (HTS) workflows, robotic automation generates a massive amount of data. These data need to be automatically collected, harmonized, labeled, and sent to screening analytics tools in order to configure the robots for the next set of experiments.
- In late-stage development and manufacturing, labs are constantly checking the quality of batches and the performance of their method. Harmonizing these data enables analytics to compare method parameters, batches, and trends over time — flagging anomalies and potentially yielding important insights in terms of batch quality and system suitability.
In both examples, merely storing the data, collecting the data, or providing data transformation is not enough. To yield benefits, these key operations need to be architected, implemented, tracked, and surfaced in a holistic way, targeting the end-to-end flow of those particular data sets.
Data is stripped of context, limiting utility. While data collected without context may be meaningful to scientists that recognize the file, such data will be useless for search/query, post-facto analytics, and data science at scale. Metadata providing scientific context, details about instruments, environmental state, and other details must be determined and added beginning soon after data is ingested. If this doesn't happen:
- It can be difficult to appropriately enrich (or sometimes, even parse) vendor-specific or vendor-proprietary data
- Data integrity issues — common for experimental data and when working with external partners — may be missed
- A significant fraction of total data cannot be used easily by data scientists because it lacks fundamental information about how it was created and what it means
For more detail, see Executive Conversations: Evolving R&D with Siping “Spin” Wang, President and CTO of TetraScience | Amazon Web Services and Move Beyond Data Management.
Problem #2: Impedance Mismatch
The “impedance mismatch” between industry-agnostic, “horizontal” data solutions and biopharma workflows is amplified by the complexity of the life sciences domain.
Scientific workflows are complex. A single small-molecule or biologics workflow can comprise dozens of sequential phases, each with many iterative steps that consume and produce data. As workflows proceed, they fork, reduplicate, and may transition among multiple organizations with different researchers, instruments, and protocols.
Biopharma has myriad instruments and software systems per user, producing and consuming complex, diverse, and often proprietary file and data types. Distributed research (e.g., collaboration with CROs and CDMOs) adds new sources, formats, standards, and validation requirements to every workflow. This additional complexity results in research data locked within a huge number of systems and formats, requiring a knowledge of each in order to validate, enhance, consume, or reuse – a daunting task to say the least.
Building effective integrations is EXTREMELY difficult. If a life sciences organization builds their own data platform using horizontal components, such as Mulesoft, Pentaho, Boomi, Databricks, or Snowflake, it inevitably also needs to build and maintain all the integrations required to pull data from or push data to instruments, informatics applications, CROs/CDMOs, and other data sources and targets. This last mile integration challenge is a never-ending exercise — where the challenge of creating and maintaining fully-serviceable integrations exceeds the capacity of biopharma IT organizations and distracts from other, more strategic and scientifically important, work. For a closer look at technical and organizational requirements for engineering life sciences integrations, see our blog: What is a True Data Integration, Anyway?
Two strategies are often considered for managing integration development and maintenance workload:
- Outsourcing to consulting companies as professional services projects. Integrations produced this way typically take a long time to build, and almost invariably become one-off solutions that require significant ongoing investment to maintain.
- Handing off to vendors of an important data source/target (e.g., a LIMS or ELN) as “customization” or professional services work. Such efforts often produce vendor-specific and rigid, point-to-point integrations that become obsolete when changes occur or end up locking data into that particular vendor’s offering.
Neither of these two approaches treats connectivity or integration as a first-class business or product priority, meaning that these do-it-yourself projects often bog down the organization and fail to deliver ROI.
Towards a Solution for Scientific Data
In our next installment, we’ll discuss four critical requirements for untangling life sciences’ complex challenges around data and show how fulfilling these requirements enables a solution that:
- Delivers benefits quickly, helping speed replatforming scientific data to the cloud and enabling rapid implementation of high-value scientific dataflow use cases
- Scales out efficiently, enabling IT to plan and resource effectively, and freeing scientists and data scientists to refocus on scientific innovation instead of non-strategic, technical wheel spinning
An effective data cloud platform for managing scientific data requires more than just IT/cloud know-how and coding smarts. It requires a partnership with an organization that has deep life sciences understanding, a disciplined process for integration building, and a commitment to open collaboration with a broad ecosystem of partners.