Blog

File-Based Scientific Data and the Tetra File-Log Agent

February 14, 2025

Managing scientific data efficiently requires more than just capturing raw instrument and analytical outputs. Context, organization, and a structured approach to storage and retrieval are essential for sustainable, validated data management.

TetraScience Agents facilitate this process by centralizing, contextualizing, and engineering data within the Tetra Scientific Data and AI Cloud™. The Tetra File-Log Agent (FLA) is a robust tool designed to streamline data synchronization, ensuring compliance with FAIR principles while eliminating ad hoc file management from scientific workflows.

The FLA is highly configurable and optimized for various instruments and use cases. The video below demonstrates how the agent ingests data and showcases key features, including:

  • File uploading with progress monitoring
  • Contextual labeling for enhanced searchability within the Tetra Data Platform (TDP)
  • Automated file management, eliminating manual archiving and deletion
  • Bidirectional movement of files between TDP and on-premise systems
  • Checksum verification to ensure data integrity
  • Scheduler monitoring to handle system outages

Benchmarks and Use Case Recommendations

A single instance of the File-Log Agent can:

  • Monitor 100+ million files across hundreds of instrument folders
  • Manage terabytes of instrument data daily
  • Enable near real-time change detection

TetraScience has optimized the FLA for many real-world scenarios:

Instrument Data Onboarding

Many scientific instruments produce data in the form of files, often stored locally or on network drives. The benchmarks below illustrate initial upload times across different configurations.

*Scales linearly with increases in thread count, compute resources, and network bandwidth

File Size
Number of Files CPU RAM Upload Threads
Total Upload Time*
Upload Speed (Mbps)*
3MB

100,000

4 16GB 6 (default) 7.5 hr 89.8
3MB

100,000

4 16GB 12 4.3 hr 155.05
3MB 100,000 4 16GB 20 3.0 hr 221.83
3MB 100,000 4 16GB 30 2.5 hr 266.9
3MB 100,000 8 32GB 48 1.5 hr 455.7
1.2GB
100 8 32GB 48 30 min 491.02
1.2GB 100 8 32GB 56 30 min 487.13

Lab Data Automation

As soon as an instrument generates a file, the FLA detects and scans it, initiating a series of operations such as uploading to the cloud, contextualization, and verification. This process enables data engineering, ensuring files are properly formatted for downstream systems like an ELN or LIMS. Low latency is key.

To optimize performance, the FLA prioritizes the most recently created and modified files, preventing historical data ingestion from delaying access to fresh data. Additionally, it intelligently manages large files and deeply nested folders, ensuring they do not block the timely ingestion of other configured scan paths.

Real-Time File Monitoring

The FLA continuously scans all configured paths for changes while also capturing real-time file creation and modification events from most file systems. By uploading these files independently of the scheduled scan process, the agent ensures consistent latency regardless of the volume of files being processed.

The chart below presents data from a test where new files were continuously added to monitored folders already containing 200,000 files. During the test, 10 new files per minute were created, scanned, and uploaded.

  • The left y-axis (blue line) represents latency, measured as the time difference between a file’s last modification and when it becomes available for search on the Tetra Data Platform.
  • The right y-axis (red line) tracks the number of new files added over time.

Despite the increasing file volume, the FLA maintained consistent average latency, staying below one minute throughout the test.

Scanning and Prioritization

While event handling efficiently manages new file creation, there are times when events may be unavailable—such as during system maintenance, network interruptions, or when adding a new path. To address these scenarios, the FLA’s continuous scanning ensures all files are processed fairly, preventing any single folder from monopolizing upload bandwidth, regardless of file count or size.

The results below, gathered from a synthetic test sorted by upload time, demonstrate this prioritization. Thousands of files were scanned and uploaded from multiple numbered paths. The highlighted paths correspond to folder “1”, while files from other folders were interleaved based on the agent’s path prioritization logic.

For example, when scanning a path with 300,000 files alongside a path with 10 files, the FLA prioritizes the 10 older files first, ensuring they are uploaded before processing the larger batch. This approach prevents small but important files from being delayed by high-volume data ingestion.

Upload Event Notifications

The FLA generates events that track each file’s progress, including initial detection, upload status, and any errors encountered. These events integrate seamlessly with downstream monitoring systems and other tools, enabling alerts when files become available or require attention.

For a brief overview of this feature, watch this video:

High-Content Imaging

High-content imaging (HCI) generates hundreds of thousands of image files daily (see figure below), leading to rapidly increasing local storage demands. Moreover, these files are often stored in deeply nested directory structures—with ten or more levels of hierarchy (e.g., division, therapeutic area, candidate/product, condition, instrument, month, and date). This complexity makes large-scale file transfers challenging.

With the FLA, each directory can be monitored for upload to cloud storage within TDP. Once uploaded, local copies can be automatically archived and deleted, freeing up storage and eliminating the need for manual data clean-up.

Benchmark results demonstrate that a single FLA can efficiently manage HCI instruments that generate 333,000 files per day (each 3MB) within a 10-level nested folder structure.

Next-Generation Sequencing 

Next-generation sequencing (NGS) produces large to extremely large files (ranging from GB to TB) depending on sequencing depth. BCL files (raw and organized per cycle) are automatically converted into FASTQ files (organized by reads) and typically demultiplexed at this stage. Each sequencing run generates a new directory of demultiplexed FASTQ files, with one file per sample.

With multiple GB per sample, data storage can quickly become a challenge. However, the FLA automatically deletes local copies once they are successfully uploaded to TDP, significantly reducing local storage requirements.

Storage needs for sequencing facilities vary widely, whether for a pharmaceutical sequencing core or an ensemble of multiple sequencing labs. A single FLA instance can efficiently ingest sequencing data from various instruments within a single day. See the table below for detailed performance benchmarks across different instrument models.

Instrument Est. data per run (FASTQ count) Total data size per run (GB) Est. data size per FASTQ Ingestion time (hr)
NovaSeq 6000
150 3960.0 26.40 18.33
NextSeq 2000 720 356.4 0.50 1.65
MiSeq 40 9.9 0.25 0.046
iSeq 100 96 0.7 0.01 0.003
MiSeq i100 80 19.8 0.25 0.092

Closing Remarks

The Tetra File-Log Agent is a powerful tool for scientific data management, eliminating manual file handling while enhancing data integrity, workflow efficiency, and scalability. Its versatile capabilities support a wide range of use cases across the biopharma value chain.