Blog

Bridging network gaps: Connecting on-premise scientific data to cloud data platforms

July 9, 2024

Justin Pront
Sr. Director, Product

There’s huge potential in using scientific data for data analytics, AI, and lab data automation use cases. A critical first step to unlocking the power of scientific data is replatforming it from where it’s been created to a cloud-based data platform. It can then be further engineered or used in specific scientific analysis applications from leading vendors such as FlowJo or even ones created by TetraScience. While this may initially appear as a mere data integration activity—taking heterogeneous data files and transforming them into a schematized format—the first complexity to tackle is overcoming network connectivity from a customer’s lab to a SaaS (software as a service) cloud environment. A lab network is typically secured by protection mechanisms, such as firewalls, to ensure data security and prevent operational disruption. Overcoming these hurdles in a manner that enables a smooth flow of data between the lab and the cloud while still meeting the security needs of an enterprise biopharma company is a balanced art.

At TetraScience, we’ve seen many of these hurdles delay customers from reaping the benefits of accessing their scientific data by months. With our experience in replatforming data for many of the world's 50 largest biopharma companies, we’ve implemented key architecture patterns and processes to ensure both the security needed by our customers and the flexibility to adapt to their varying IT, network, and security policies.

From lab to cloud

The largest proportion of scientific data is generated directly in a physical lab, either by scientists or by a lab instrument. The data spans from simple values, such as those from a laboratory balance, to complex datasets like a full genomic sequence from a next-generation sequencer. At either end of the spectrum, what remains the same is that there’s data from a measurement of physical matter generated in a scientific facility, be it an R&D lab or a manufacturing plant.

In every case it is critical to protect this data against unauthorized external access independently from where it has been generated and where it resides because:

  • Efficiency of scientific operations is business critical and millions of dollars can be lost by delays in bringing a therapeutic to market
  • Scientific data is a “crown jewel” and a competitive advantage for companies discovering, developing, and manufacturing therapeutics

To ensure data security, customers implement multiple layers of network protection and policies. Network isolation and firewalls typically exist at two different levels:

  • Corporate level (company firewall)
  • Operational technology level (“instrument/lab” network)

Scientific data can reside at multiple levels and in a plethora of different locations that are difficult to secure.

Scientific data Examples of source systems System network location
Instrument and control data
  • Chromatography data system (CDS)
  • Plate reader
  • Flow cytometer
  • Microscope
Lab networks
Experiment context and design
  • Electronic lab notebook (ELN)
  • Laboratory information management systems (LIMS)
  • Laboratory execution system (LES)
  • Manufacturing execution system (MES)
Either within company firewall or the cloud, often within a virtual private cloud (VPC)
Analysis results
  • Instrument data processing software
  • Data analysis software (including Excel)
Within company firewall (e.g., network drives)
External collaborator data
  • CRO assays
  • CDMO records (e.g., PDF)
External entity

Alternatively, you can decide to move to a SaaS solution, operating in the cloud on infrastructure managed by a professional SaaS company. Please keep in mind if the software platform is running inside your network on infrastructure or cloud resources that you own and manage, it’s not really SaaS. Once you decide on a true SaaS, you will still have to connect and secure the solution to work with your scientific data where it’s generated. However, you won’t have to worry about the ongoing capital expenditures on technology infrastructure and the additional staff needed to manage it. And there are some complexities related to this move.

Complexities

Multiple internal networks 

Network connectivity between the cloud and a private company’s networks cannot be simplified to a single “point-to-point” connection. In addition to dealing with corporate and lab networks, SaaS solutions are often deployed within a virtual private cloud (VPC) that is connected to a customer’s own cloud infrastructure and VPC. Connecting many networks is challenging.

Segregated team organization 

Lab IT, scientists, and data teams who benefit from leveraging scientific data are organizationally separated from the security and network teams who maintain the network infrastructure and govern the policies. This structure is necessary to ensure consistency across a company and data security. However, it means that any request to set up connectivity and open up a company’s network can take days, weeks, or sometimes months to approve and implement. The teams impacted by the delays cannot remove these hurdles.

Key benefits

Accessing scientific data

The most important benefit of overcoming network connectivity with a cloud-based data platform is that scientists, data engineers, data scientists, and other stakeholders in the organization have access to their scientific data. Not only is this the first step in getting value from your data, it is the most vital. 

Today, we’ve seen countless customers who have been using USB sticks to transfer data from their lab instruments and overcrowded network drives moving to FAIR data that’s available via search in a UI, API, or SQL. The connectivity is typically established once and then can be scaled to any data source in the network (corporate or lab), amplifying the benefits.   

Reducing total cost of ownership

By using a SaaS data platform to replatform scientific data, a customer’s infrastructure teams are relieved from managing and maintaining the software and infrastructure, including upgrades and security patches. This reduces the total cost of ownership (TCO) of the solution. 

Our learnings 

At TetraScience, we've gained valuable insights from deploying solutions at many of the top 50 global biopharma companies. This experience has taught us how to architect a SaaS product that works with the complexities of our customers’ networks. Here are some of the key lessons we learned and the architectural patterns we employ.

Directionality is critical

When working with a cloud-based data platform, it’s not just about the connection between the instruments and software to the data platform. The direction of the data flow is also important. Companies’ policies can differ and require different solutions when dealing with the following dimensions:

  • Data and interfaces from your network to the SaaS provider (outbound)
  • Data and interfaces from the SaaS provider to software in your network (inbound)

Multi-tenant data isolation is vital

When selecting a cloud-based data platform, it’s important to verify during the selection process that your data is logically and physically segregated from other customers’ data in the vendor-hosted solution. At TetraScience, we ensure that data from different customers are not co-hosted and that access is segregated at the interface (UI/API), processing, storage, and network levels in all of our multi-tenant deployments.

No architecture will meet all customer needs

We’ve found that while each customer’s corporate network policies follow basic tenets to ensure security, they vary enough that no single solution can meet every customer’s needs. To overcome this, we’ve employed multiple solutions and approaches to make it easy for scientific data to flow between the different data sources of a customer and the Tetra Data Platform. These include:

  • Outbound flow from proxies to the Tetra Data Platform
  • Supporting proxies as part of our cloud-orchestrated pluggable connectors (outbound)
  • Supporting multiple standard architecture patterns for secure network connections
    • AWS support secure connections via AWS PrivateLink
    • API gateways 
  • Developing the Tetra Hub to facilitate a single point of network connectivity between the customer’s network and the Tetra Data Platform
  • Working with standalone secure proxies to ensure a limited, known connectivity scope 

Below is a representative diagram of a deployment that employs several of these methods.

At TetraScience, we’ve bridged the network gap between the cloud and the lab to help enterprise biopharma companies transform their raw scientific data into analytics- and AI-ready datasets. We’re excited to share our insights and expertise with you.