Blog

Data Readiness in AI

May 7, 2024

Dr. Daniela Pedersen
Ph.D., TetraScience VP of Product Marketing

This blog post reflects some of the key thoughts captured during the panel discussion “Data readiness for AI” at Bio-IT World Conference & Expo 2024 with delegates from TetraScience, Bayer, AbbVie, Sanofi, Merelogic, and DataRobot*.

While acquiring and governing data are key to using data for AI, the human component cannot be neglected and is here to stay. Organizations need to work better and smarter with data and focus on the outcomes.

Purpose and goals

Biopharmaceutical organizations are in the business of developing therapies and AI is poised to help. One of the many examples is indication expansion or drug repurposing, that is, finding new indications for an approved drug.

Sometimes, AI is applied to answer a very specific question. For instance, how can the formulation for a specific drug be optimized? It can also be used to design experiments for broadly scoped exploratory work or a stack of different assays in research. Regardless of the application, the key is to prepare data in a way that aligns with the desired outcome.

To take the right approach, companies should consider the business value of applying a certain technology and the expected ROI. Some technologies require significant implementation efforts and change management while the ROI is only marginal.

Data and metadata

The success of artificial intelligence (AI) models relies heavily on the volume and diversity of training data, which ensures the models are robust and reliable. This is no different in life science. In-house experimental data is often insufficient to answer certain scientific questions. Incorporating external data can provide the required diversity to fill the gaps. Teams can combine data from internal sources (scientists and labs), partners like contract research organizations (CROs), or acquire data externally (e.g., biobanks). 

Setting overly strict standards for data management often impedes data usage instead of optimizing it. The critical factor is how the data is organized, which differs for each organization. It must be independent of the systems that create the data and focus on the format required to achieve the desired outcome.

Data quality is critical. High-quality data is traceable from its origin with full data lineage and available in a shareable format. To help achieve this, organizations can digitize their data, set up pragmatic governance, and create simple decision-making workflows.

Another aspect is the metadata. Without the right metadata, data cannot be effectively used for AI, or the results may be inaccurate. Data scientists often manage metadata in Excel sheets or similar. But when the data (and metadata) is used by different teams, metadata must be trackable and harmonized. Otherwise, it can become a big stumbling block that causes monthslong delays.

Data generators and data consumers 

Teams typically consist of three cohorts: scientists as the data generators, data scientists as data consumers or end users, and IT experts who support the system. Typically, there is little synergy and collaboration between these groups, as each has different priorities. Data engineers from IT are closest to the system and are responsible for the technical aspects. Bench scientists focus on the execution of their wet lab experiments and typically don’t consider data reuse. Data scientists often find the data generated in labs to be inconsistent, messy, and difficult to consume. This disconnect leads to misalignment, creating a hurdle to collaboration. In these cases, organizations struggle to gain immediate or medium-term value from scientific data, including its usage for AI. 

It is important for teams to avoid getting frustrated and instead try to understand each other’s constraints and work together to find solutions. While lab leaders can orchestrate and drive this collaboration, changes in leadership can disrupt the process.

The goal of Scientific AI to accelerate and improve scientific outcomes points to the solution: science and scientists should take the lead. They can define the scientific use cases that guide data abstraction and implementation strategies.

Question from a wet lab scientist:

The scientist should be in the center, but they are often not knowledgeable enough to speak confidently with IT and data scientists. What can be done to give them more confidence without a data science background?

Answer from the panel:

While upskilling teams teaching the fundamentals of AI, "copilots" that know the science and the data can help them. They are “trilingual”. They know the domain, the data, and the technology. They can help to understand the business context of a model. Being able to conceptualize will enable them to challenge the models.

People and procedures

People have diverse backgrounds and experiences. While some are tech-savvy, others trust paper more than electronic documents. Lab scientists may struggle to adapt to the continual and sometimes unsuccessful introduction of new technology and tools. But agility in working with data and technology is a must for successful teams. For that reason, organizations need to come up with ways to overcome the reluctance and encourage adoption among the users.

Also, when moving to workflows that use AI, a cultural shift needs to happen. Scientists designing experiments need to learn how to leverage AI/ML to make their projects more successful.

Therefore, it is critical to change the data culture in the organization. This can be achieved by involving end users in the data strategy early on and by collectively exploring ways to combine data domains for AI. Such efforts will help blend scientific workflows and satisfy the needs of AI.

Data and domains

The awareness of data and its consistency is paramount, especially when moving between domains. Each domain has its specific way of working, making it difficult to agree on data-related procedures across different domains. Therefore, it is critical to think collectively about how to bring data from different domains together for AI.

Workflows in chemistry, manufacturing, and controls (CMC) are more repetitive than those upstream (e.g. high-throughput screening in early discovery), where consistency is more difficult to achieve. Research is often conducted in silos, making it difficult to know what data is available. Also, most organizations lack a clear data policy, which leads to inconsistent data. Thus, data governance and the strategies that incentivize adherence to these rules must be tailored for different stages and labs. This is crucial to shift mindsets and behaviors toward fostering the data consistency that AI requires. 

Final thoughts

Behind every data point is a patient who hopes to overcome their illness. Are we heading toward a future where AI will single-handedly provide us with cures? No. We need good data quality and consistency, good metadata, good lineage, good data governance, and thoughtful science (from empowered scientists) to enable human-driven Scientific AI. 

To learn more about how to satisfy data-related requirements, read the Tetra Scientific Data and AI Cloud solution brief.

*Acknowledgements: the following experts have contributed the content on which this blog post is based: Gian Prakash (AbbVie), Jay Schuren (DataRobot), Jesse Johnson (Merelogic), Santha Ramakrishnan (Bayer), Shameer Khader (Sanofi), and Spin Wang (TetraScience).