File-Based Scientific Data and the Tetra File-Log Agent

Managing scientific data efficiently requires more than just capturing raw instrument and analytical outputs. Context, organization, and a structured approach to storage and retrieval are essential for sustainable, validated data management.

TetraScience Agents facilitate this process by centralizing, contextualizing, and engineering data within the Tetra Scientific Data and AI Cloud™. The Tetra File-Log Agent (FLA) is a robust tool designed to streamline data synchronization, ensuring compliance with FAIR principles while eliminating ad hoc file management from scientific workflows.

The FLA is highly configurable and optimized for various instruments and use cases. The video below demonstrates how the agent ingests data and showcases key features, including:

File uploading with progress monitoring
Contextual labeling for enhanced searchability within the Tetra Data Platform (TDP)
Automated file management, eliminating manual archiving and deletion
Bidirectional movement of files between TDP and on-premise systems
Checksum verification to ensure data integrity
Scheduler monitoring to handle system outages

‍

Benchmarks and Use Case Recommendations

A single instance of the File-Log Agent can:

Monitor 100+ million files across hundreds of instrument folders
Manage terabytes of instrument data daily
Enable near real-time change detection

TetraScience has optimized the FLA for many real-world scenarios:

Instrument Data Onboarding

Many scientific instruments produce data in the form of files, often stored locally or on network drives. The benchmarks below illustrate initial upload times across different configurations.

*Scales linearly with increases in thread count, compute resources, and network bandwidth
File Size	Number of Files	CPU	RAM	Upload Threads	Total Upload Time*	Upload Speed (Mbps)*
3MB	100,000	4	16GB	6 (default)	7.5 hr	89.8
3MB	100,000	4	16GB	12	4.3 hr	155.05
3MB	100,000	4	16GB	20	3.0 hr	221.83
3MB	100,000	4	16GB	30	2.5 hr	266.9
3MB	100,000	8	32GB	48	1.5 hr	455.7
1.2GB	100	8	32GB	48	30 min	491.02
1.2GB	100	8	32GB	56	30 min	487.13

‍

Lab Data Automation

As soon as an instrument generates a file, the FLA detects and scans it, initiating a series of operations such as uploading to the cloud, contextualization, and verification. This process enables data engineering, ensuring files are properly formatted for downstream systems like an ELN or LIMS. Low latency is key.

To optimize performance, the FLA prioritizes the most recently created and modified files, preventing historical data ingestion from delaying access to fresh data. Additionally, it intelligently manages large files and deeply nested folders, ensuring they do not block the timely ingestion of other configured scan paths.

‍

Real-Time File Monitoring

The FLA continuously scans all configured paths for changes while also capturing real-time file creation and modification events from most file systems. By uploading these files independently of the scheduled scan process, the agent ensures consistent latency regardless of the volume of files being processed.

The chart below presents data from a test where new files were continuously added to monitored folders already containing 200,000 files. During the test, 10 new files per minute were created, scanned, and uploaded.

The left y-axis (blue line) represents latency, measured as the time difference between a file’s last modification and when it becomes available for search on the Tetra Data Platform.
The right y-axis (red line) tracks the number of new files added over time.

Despite the increasing file volume, the FLA maintained consistent average latency, staying below one minute throughout the test.

‍

Scanning and Prioritization

While event handling efficiently manages new file creation, there are times when events may be unavailable—such as during system maintenance, network interruptions, or when adding a new path. To address these scenarios, the FLA’s continuous scanning ensures all files are processed fairly, preventing any single folder from monopolizing upload bandwidth, regardless of file count or size.

The results below, gathered from a synthetic test sorted by upload time, demonstrate this prioritization. Thousands of files were scanned and uploaded from multiple numbered paths. The highlighted paths correspond to folder “1”, while files from other folders were interleaved based on the agent’s path prioritization logic.

For example, when scanning a path with 300,000 files alongside a path with 10 files, the FLA prioritizes the 10 older files first, ensuring they are uploaded before processing the larger batch. This approach prevents small but important files from being delayed by high-volume data ingestion.

‍

Upload Event Notifications

The FLA generates events that track each file’s progress, including initial detection, upload status, and any errors encountered. These events integrate seamlessly with downstream monitoring systems and other tools, enabling alerts when files become available or require attention.

For a brief overview of this feature, watch this video:

‍

High-Content Imaging

High-content imaging (HCI) generates hundreds of thousands of image files daily (see figure below), leading to rapidly increasing local storage demands. Moreover, these files are often stored in deeply nested directory structures—with ten or more levels of hierarchy (e.g., division, therapeutic area, candidate/product, condition, instrument, month, and date). This complexity makes large-scale file transfers challenging.

With the FLA, each directory can be monitored for upload to cloud storage within TDP. Once uploaded, local copies can be automatically archived and deleted, freeing up storage and eliminating the need for manual data clean-up.

Benchmark results demonstrate that a single FLA can efficiently manage HCI instruments that generate 333,000 files per day (each 3MB) within a 10-level nested folder structure.

‍

Next-Generation Sequencing

Next-generation sequencing (NGS) produces large to extremely large files (ranging from GB to TB) depending on sequencing depth. BCL files (raw and organized per cycle) are automatically converted into FASTQ files (organized by reads) and typically demultiplexed at this stage. Each sequencing run generates a new directory of demultiplexed FASTQ files, with one file per sample.

With multiple GB per sample, data storage can quickly become a challenge. However, the FLA automatically deletes local copies once they are successfully uploaded to TDP, significantly reducing local storage requirements.

Storage needs for sequencing facilities vary widely, whether for a pharmaceutical sequencing core or an ensemble of multiple sequencing labs. A single FLA instance can efficiently ingest sequencing data from various instruments within a single day. See the table below for detailed performance benchmarks across different instrument models.

Instrument	Est. data per run (FASTQ count)	Total data size per run (GB)	Est. data size per FASTQ	Ingestion time (hr)
NovaSeq 6000	150	3960.0	26.40	18.33
NextSeq 2000	720	356.4	0.50	1.65
MiSeq	40	9.9	0.25	0.046
iSeq 100	96	0.7	0.01	0.003
MiSeq i100	80	19.8	0.25	0.092

‍

Closing Remarks

The Tetra File-Log Agent is a powerful tool for scientific data management, eliminating manual file handling while enhancing data integrity, workflow efficiency, and scalability. Its versatile capabilities support a wide range of use cases across the biopharma value chain.

Example H4

Example H5

Reimagine Scientific Data Management

Transform your data. Enable lab data automation. Drive analytics and AI.

Explore how