Even More Rapid Retrieval from Very Large Files with Rust


In a previous post Rapid Data Retrieval from Very Large Files, Eric Danielson discussed how Ginkgo’s software engineers are enabling fast access to massive sequence datasets. As a bioinformatician, I need file formats such as fasta and gff, for command line tools and biological sequence-processing libraries. However, these flat-file, semi-structured formats don’t scale to hundreds of millions of sequences. File compression makes data storage tractable, but finding and extracting a specific sequence among millions in a random position in the file has been a bottleneck.

Please read Eric’s full article, but the gist is that the block gzip (BGZip) from the samtools/htslib project is the highest performing tool for getting (compressed) data out of a standard file store. BGZip’s “trick” is to concatenate multiple gzipped files together – where a single gzip block is no more than 64KB. A secondary index file, under GZI extension, is used to keep track of the locations of the blocks, with respect to the compressed and uncompressed bytes. BGZipped data is itself a valid gzip archive and thus, can be (de)compressed with any gzip library.

Using these tools, we can retrieve a sequence somewhere in the middle of a huge, bgzipped fasta file (and we already know position of the first uncompressed byte), skipping directly to the relevant block – even if it’s at the end of the file – and only decompress the chunks that we actually need.

In practice, we usually won’t know ahead of time where a particular sequence is located in the file, but rather just its name. The fasta format specifies that names and other meta-data are stored in the headers of the fasta file, which are lines demarcated with a “>” character and thus marks the start of a new record. The actual sequence follows on the next line(s). Eric’s post also discussed creating a 3-column, tab-separated location file that is especially useful for fasta files. This tracks the sequence name (i.e. pulled from the fasta header), the uncompressed start position and the size (in bytes) of the full record. We’ll use a .loc extension for this file.

We’ve used this file format in Ginkgo’s metagenomic data workflows together with htslib’s bgzip command line tool, which has options for specifying the position and number of characters of the desired place in the file.

So why did we want to improve on this already-excellent solution? Well, the bgzip command line tool only has the option to extract one contiguous part of the file (one start position and its offset). We want to be able to retrieve potentially thousands of sequences from our fasta files in one shot.

We had been handling this in the simplest way possible; a bash script that loops through all the sequences/positions and calls bgzip. But it turns out that this is pretty inefficient (I’ll get to the specific timing results in a bit). Similarly, we’ve been building the location file by un-gzipping the library fasta file and using a simple python loop to record the sequence name and position (and this doesn’t exploit the block GZip structure at all).

Ideally, we would implement our BGZip reading tools using the native C library from htslib. But this has drawbacks – mostly in maintenance costs since we don’t have any C developers in our group. Kudos to samtools for open-sourcing this codebase, but it doesn’t meet our needs.

This is where Rust and the excellent noodles crate, a collection of bioinformatics I/O libraries, comes in.

In the last few years, the Rust programming language has been taking the bioinformatics community by storm. It offers thread-safe, low level access to all your bits and bytes. Therefore, it’s the perfect use case for trying to squeeze out additional computational efficiency here, since we want fast reads from bgzipped files with as small a memory footprint as possible.

Using Rust and noodles, we can create small, safe command line utilities for accessing our compressed data. And while Rust has a steeper learning curve than scripting languages like python, it’s not nearly as daunting as C in this use-case. Moreover, noodles provides paralleling and asynchronous processing plugins for BGZip formats which will give us, for example, simultaneous parsing of independent data blocks.

Reading Sequences in BGZF-compressed fasta Files

Firstly let’s set up some test data which we’ll use going forward to obtain performance metrics. We’ll be using amino acid sequences from the same soil metagenomic library that Eric talked about in his last post. It contains 150 million sequences and the fasta file is 235 gigabytes when uncompressed.

However, for this post, we don’t need such a large file just for benchmarking. These methods are all linear in computational complexity, so I’ll just grab a subset of the total:

$ grep -c ">" ca29_soil2_aa_sub.fasta

Let’s use bgzip to compress (the -c flag) and index (the -r flag) the fasta file:

$ bgzip -c ca29_soil2_aa_sub.fasta > ca29_soil2_aa_sub.fasta.gz
$ bgzip -r ca29_soil2_aa_sub.fasta.gz

The last command creates a binary index file called ca29_soil2_aa_sub.fasta.gz.gzi in our working directory.

In our standard setup, we needed to decompress the gzip file (since we do not store uncompressed files) to get the locations of the fasta headers and length of the sequences (tracking the raw number of characters, including the header line). We can do that with a relative simple python script that keeps track of the current byte position of each new header and writes out a tab-separated locations file:

Let’s run that now and also keep track of the execution time.

$ time ( 
   zcat < ca29_soil2_aa_sub.fasta.gz > ca29_soil2_aa_sub.fasta &&\
   python fasta_indexer.py ca29_soil2_aa_sub.fasta 
real    1m10.992s
user    0m31.566s
sys 0m3.770s

time is a system command available in UNIX systems that tells us how long the program takes to run. The output we care about is “real” time, which is what the human user is experiencing.

The script fasta_indexer.py made a location file called ca29_soil2_aa_sub.fasta.loc. Let’s take a peek at it. Column 1 is the header names, Column 2 is the position of that sequence in the file and Column 3 is the size of the sequence:

$ head -n 2 ca29_soil2_aa_sub.fasta.loc && echo "..." && tail -n 2 ca29_soil2_aa_sub.fasta.loc
ca29_soil2_c51cf9d4-216c-4963-8f9b-65ee6b197a07_1   0   116
ca29_soil2_c51cf9d4-216c-4963-8f9b-65ee6b197a07_4   116 448
ca29_soil2_99a091d5-f352-4e66-8a50-3c48d2070870_19  324714018   187
ca29_soil2_e966a3b0-d674-47bb-b8b4-6db071c3e3f4_5   324714205   456

For timing, lets just say our sequences are randomly located in the file and so we can just subsample ca29_soil2_aa_sub.fasta.loc for the simulation. 3000 sequences should be the right size for this demo:

shuf -n 3000 ca29_soil2_aa_sub.fasta.loc | sort -k2n > query_loc.txt

Now let’s run the bgzip extraction code, accessing each sequence via the -b/-s (offset/size) command line options from within a bash loop:

$ time (
  while IFS= read -r loc_id
    ## converts loc file to bgzip arguments
    offset=$( echo "$loc_id" | cut -f2  )
    size=$( echo "$loc_id" | cut -f3  )
    bgzip -b $offset -s $size ca29_soil2_aa_sub.fasta.gz >> ca29_query.fasta
  done < query_loc.txt
real    0m31.408s
user    0m10.102s
sys 0m14.933s

Can we do better than a bash loop by rewriting our BGZip reader to process all of our desired sequence locations in one go?

Let’s give it a try: It turns out that the IndexedReader from the noodles-bgzf sub-crate. Here we simply need to adapt the example code for seeking an uncompressed position into a loop that does the same thing. We’ll call this program bgzread:

// bgzread

use std::{
 io::{self, BufRead, BufReader, Read, Seek, SeekFrom, Write},

use noodles_bgzf as bgzf;

fn main() -> Result<()> {
 // Gather positional cli argument for the input files
 let mut args = env::args().skip(1);
 let src = args.next().expect("missing src");
 let loc = args.next().expect("missing location");

 // Set up the readers for the compressed fasta and location files
 let mut reader =  bgzf::indexed_reader::Builder::default().build_from_path(src)?;
 let loc_reader = File::open(loc).map(BufReader::new)?;

 // We'll be writing the final data to standard out
 let mut writer = io::stdout().lock();

// Loop through the location reader to get positions
 for line in loc_reader.lines() {

   // parse the lines
   let line = line.unwrap();
   let (raw_position, raw_length) = line.split_once(" ").unwrap();
   let pos: u64 = raw_position.parse()?;
   let len: usize = raw_length.parse()?;

   // Seek start position and read 'len' bytes into the buffer
   let mut buf = vec![0; len];
   reader.read_exact(&mut buf)?;

   // Write the outputs


Dare I claim that this is a pretty darn simple program? We have to do a little bit of explicit typing for the position and len variables and all the ? and unwrap statements are a bit mysterious. But basically, this is how Rust language gets developers to be explicit about what types of data we’re passing around and to handle potential errors. In this example, we’ll just throw the built-in panic message if, say the user-specified input file doesn’t exist.

Full disclosure that this isn’t the exact bgzread script we have in production at Ginkgo – we’ve got some customized error handling and other file buffering tricks. But this will be good enough for the demo.

The other nice part about Rust, is that the cargo packaging system compiles our program, taking care of linking libraries and dependencies. We can just ship the standalone binary to our users.

Now that I’ve covered the basics, we can get to the good stuff:

$ time bgzread ca29_soil2_aa_sub.fasta.gz query_loc.txt > ca29_query.fasta
real    0m4.057s
user    0m1.165s
sys 0m0.158s

Based on the above time results we can get ~5x speedup in our demo sequences by re-implementing the loop in Rust!
We’ll get to why this is so much faster a bit later. For now I’ll just say that it’s really nice to have control over all the BGZF classes, which are natively implemented by noodles-bgzf.

You’ll notice that we’re also using BufReader so that we can dump decompressed, buffered outputs to stdout so that we never have to hold much data in memory.

I’ll also mention that this bgzip reader can also be made multi-threaded (which I’m not showing in the example code). Since the gzip blocks are independent, we can potentially access and decompress multiple blocks simultaneously. I won’t run that here, since the datasets are small enough that the parallelization overhead isn’t worth it. Later, we will see an example of thread-safe file access that we get in Rust.

What is the performance of bgzread in the wild? We have a Nextflow pipeline called nextflow-fasta-selector to look up sequences from their names (you don’t need to know the locations ahead of time). In our test case, we wanted to look up 20k sequences from the full ca29 contig library:

Nextflow pipeline screenshot of 20k sequence retrieval

Astonishingly, this results in a 14.8x speedup, and we are reading 240x less data!
The bgzip+bash loop is not quite so bad as it looks here since in production, we can actually split up the 20k sequences into chunks of around 500-1k sequences and batch-process them. However, it is clearly preferable not to have to set up a single bgzip CLI call for every seek + read/deflate operation we need to do.

Understanding GZI Files

Under the hood, BGZF uses another file to track the position of the independently-compressed GZip blocks, called the gzip-index or GZI file. This file uses the .gzi extension. It consists of the compressed and uncompressed offsets of the blocks, as 64-bit unsigned integers – and starting with the second block of course, the first block will always start at (0, 0). But don’t bother trying to read this file directly, fellow human, it’s in a binary format.

Earlier in this post, I made the claim that part of what we gain by using noodles is not just that Rust loops are faster than BASH loops, but that by writing our own program, we only have to read the GZI file once. Our script gets the GZip block positions, and then we can look up our sequences iteratively within those blocks. In bgzip, we have to read in the GZI file independently every time we want to read a position.

Let’s take a look at how long each program is spending reading this file. We can do this using strace, a tool available in Linux. Since both bgzip’s C routines and Rust’s noodles library are at the end of the day using the system calls to read files, strace can tell us when those calls are happening.

For strace, we’ll alias a command that asks traces of openat, reads and close file operations and the flags -ttt will give us timestamps. Finally, we’ll pipe the log (which would normally go into stdout) into a file:

$ alias stracet="strace -ttt -o >(cat >strace.log) -e trace=openat,close,read"

Now let’s call this command for a single bgzip run, for an arbitrary sequence and parse the log file to see where the gzi is getting open and read:

$ stracet bgzip -b 112093709 -s 1149 ca29_soil2_aa_sub.fasta.gz > seq.fa
$ sed -rn '/openat\(.*\.gzi\"/,/close/p' strace.log
1677366815.387446 openat(AT_FDCWD, "ca29_soil2_aa_sub.fasta.gz.gzi", O_RDONLY) = 5
1677366815.388736 read(5, "n\23\0\0\0\0\0\0$\221\0\0\0\0\0\0\0\377\0\0\0\0\0\0^\"\1\0\0\0\0\0"..., 4096) = 4096
1677366815.389304 read(5, "\0\0\377\0\0\0\0\0\311\2\220\0\0\0\0\0\0\377\377\0\0\0\0\0\7\221\220\0\0\0\0\0"..., 4096) = 4096
1677366815.389667 read(5, "\0\0\376\1\0\0\0\0h\215\37\1\0\0\0\0\0\377\376\1\0\0\0\0\370\34 \1\0\0\0\0"..., 4096) = 4096
1677366815.389938 read(5, "\0\0\375\2\0\0\0\0\201$\257\1\0\0\0\0\0\377\375\2\0\0\0\0\233\266\257\1\0\0\0\0"..., 4096) = 4096
1677366815.390206 read(5, "\0\0\374\3\0\0\0\0c\233>\2\0\0\0\0\0\377\374\3\0\0\0\0\27*?\2\0\0\0\0"..., 4096) = 4096
1677366815.390468 read(5, "\0\0\373\4\0\0\0\0/\331\315\2\0\0\0\0\0\377\373\4\0\0\0\0>g\316\2\0\0\0\0"..., 4096) = 4096
1677366815.390690 read(5, "\0\0\372\5\0\0\0\0hZ]\3\0\0\0\0\0\377\372\5\0\0\0\0Z\351]\3\0\0\0\0"..., 4096) = 4096
1677366815.391017 read(5, "\0\0\371\6\0\0\0\0y\376\354\3\0\0\0\0\0\377\371\6\0\0\0\0U\216\355\3\0\0\0\0"..., 4096) = 4096
1677366815.391344 read(5, "\0\0\370\7\0\0\0\0\236\230|\4\0\0\0\0\0\377\370\7\0\0\0\0\270*}\4\0\0\0\0"..., 4096) = 4096
1677366815.391668 read(5, "\0\0\367\10\0\0\0\0\311\34\f\5\0\0\0\0\0\377\367\10\0\0\0\0\335\254\f\5\0\0\0\0"..., 4096) = 4096
1677366815.391938 read(5, "\0\0\366\t\0\0\0\0\206\217\233\5\0\0\0\0\0\377\366\t\0\0\0\0\347\35\234\5\0\0\0\0"..., 4096) = 4096
1677366815.392226 read(5, "\0\0\365\n\0\0\0\0\212\16+\6\0\0\0\0\0\377\365\n\0\0\0\0A\237+\6\0\0\0\0"..., 4096) = 4096
1677366815.392492 read(5, "\0\0\364\v\0\0\0\0r\246\272\6\0\0\0\0\0\377\364\v\0\0\0\0'6\273\6\0\0\0\0"..., 4096) = 4096
1677366815.392724 read(5, "\0\0\363\f\0\0\0\0\276\37J\7\0\0\0\0\0\377\363\f\0\0\0\0\217\257J\7\0\0\0\0"..., 4096) = 4096
1677366815.393001 read(5, "\0\0\362\r\0\0\0\0z\231\331\7\0\0\0\0\0\377\362\r\0\0\0\0\f,\332\7\0\0\0\0"..., 4096) = 4096
1677366815.393291 read(5, "\0\0\361\16\0\0\0\0\270\\i\10\0\0\0\0\0\377\361\16\0\0\0\0\355\354i\10\0\0\0\0"..., 4096) = 4096
1677366815.393542 read(5, "\0\0\360\17\0\0\0\0.\311\370\10\0\0\0\0\0\377\360\17\0\0\0\0\226V\371\10\0\0\0\0"..., 4096) = 4096
1677366815.393788 read(5, "\0\0\357\20\0\0\0\0\210g\210\t\0\0\0\0\0\377\357\20\0\0\0\0\330\370\210\t\0\0\0\0"..., 4096) = 4096
1677366815.394110 read(5, "\0\0\356\21\0\0\0\0\f\342\27\n\0\0\0\0\0\377\356\21\0\0\0\0\23t\30\n\0\0\0\0"..., 4096) = 4096
1677366815.394317 read(5, "\0\0\355\22\0\0\0\0~u\247\n\0\0\0\0\0\377\355\22\0\0\0\0d\4\250\n\0\0\0\0"..., 4096) = 1768
1677366815.394678 close(5)              = 0

We can see the gzi file being open, and the bytes being read into the bgzip program’s buffer and finally closed. Each operation is preceded by a UTC timestamp that we can use to calculate the how much time (in seconds) the program is operating on the file:

$ gzitime () {
   timestamps=( `sed -rn '/openat\(.*\.gzi\"/,/close/p' $1 |
    awk -v ORS=" " 'NR == 1 { print $1; } END { print $1; }'` )
   bc <<< "${timestamps[1]} - ${timestamps[0]}"

A short explanation of the above bash function: we’re opening the log file and finding all the lines between opening and closing the gzi file, as before. Then, we’re using awk to grab the first and last timestamps to calculate the time between those operations (which is all the file reads) via the basic calculator (bc) shell command.

$ gzitime strace.log

7.2 milliseconds doesn’t seem like a lot, but do that again and again, over 3000 independent runs, and it explains what our bash-looped bgzip program is spending over half of its time doing.

Compare this to our Rust program:

$ stracet bgzread ca29_soil2_aa_sub.fasta.gz <( echo "112093709 1149" ) > seq.fa
$ gzitime strace.log

Not only does this read the index a bit faster than bgzip; it also only runs once!

Indexing the Locations of Sequences in Compressed fasta Files

Clearly, we have already got some significant and actionable performance gains by implementing our own tool in Rust. This is useful, but it turns out that generating the location file for the header positions in the block GZipped-formatted fasta file is actually an even bigger bottleneck than reading locations.

Could we also turn that python script into a Rust command line tool?

I won’t show the whole program (so please don’t try to compile this at home!) but here is a snippet to give you a flavor:

// bgzindex snippet
while let Ok(num) = reader.read_line(&mut line) {

 // Supressed code: updating block_i //
 // Get current block offset from global index //
 block_offset = bgzf_index[block_i];

 if line.starts_with('>') {
   if size > 0 {
       // Write the size of the previously found record
       write_size(&mut writer, size);
   // We found a new record so reset size counter
   size = 0;

   // Write the current record name
   let name: &str = &line.to_string();
   write_name(&mut writer, &name[1..(num-1)]); //subset to trim out < & newline

   // Write the current position
   let virtual_pos = block_offset + (reader.virtual_position().uncompressed() as u64) - (num as u64);
   write_pos(&mut writer, virtual_pos);

 size += num;
 line.clear(); // Clear the line buffer for the next line

In this program, we need to keep track of three pieces of data: the sequence name (from the fasta header); the start position of the uncompressed sequence record (pos – position); and size (the number of uncompressed bytes following the start position). We have three convenience functions that I’m not showing the bodies of — write_name, write_pos and write_size — but these simply handle writing the output in the correct three column format.

To summarize the program, whenever we encounter a line that starts with a < character we know we’ve hit a new record. So, we’ll write out the size of the previous record, if one exists yet. Then, we’ll write out the new name – the whole header line except the start token and the final newline. The size (the number of uncompressed characters) of the final sequence record is also easy – we just need to keep a running sum of the size of all the lines associated with that sequence record.

Now the tricky part: we need to calculate the start position. In the python version of this code, we uncompressed the entire file first, so we could just use one global, position tracker. However in the Rust implementation, we want to take advantage of the block format. So, we’ll use the BGZip index (stored in the gzi file) to calculate a global offset relative to the virtual_position of the reader within the block and the start of the block_offset. We’ll just need to keep track of which block we’re in — that’s the block_i indexing variable — and use a local copy of bgzf_index, which is just a vector containing the uncompressed block start positions, which is read directly from the gzi file.

Finally, since reader.read_line function puts our cursor at the end of the line, we need to subtract the line length, num, to get the start position.

Let’s see how this program does:

$ time bgzindex ca29_soil2_aa_sub.fasta.gz ca29_soil2_aa_sub.fasta.loc
real    0m14.650s
user    0m2.745s
sys 0m1.328s

Which is a very nice speedup over the Python script (which took over a minute).

We can even try to parallelize the indexing over multiple threads:

$ time bgzindex ca29_soil2_aa_sub.fasta.gz ca29_soil2_aa_sub.fasta.loc 2
real    0m12.395s
user    0m2.976s
sys 0m0.966s

That’s a 6x speedup all told! Additional cores don’t seem to help much beyond two, however.

And of course the location file looks the same as before:

$ head -n 2 ca29_soil2_aa_sub.fasta.loc && echo "..." && tail -n 2 ca29_soil2_aa_sub.fasta.loc
ca29_soil2_c51cf9d4-216c-4963-8f9b-65ee6b197a07_1   0   116
ca29_soil2_c51cf9d4-216c-4963-8f9b-65ee6b197a07_4   116 448
ca29_soil2_99a091d5-f352-4e66-8a50-3c48d2070870_19  324714018   187
ca29_soil2_e966a3b0-d674-47bb-b8b4-6db071c3e3f4_5   324714205   456

Our new bgzindex has some details that lets it take advantage of multithreading. Primarily, we are using the gzi (block gzipped index) file to keep track of the reader’s current virtual position relative to the current block. This adds a bit of overhead at first (we need to read and store the data in the block indexes) but it means that we will always know our global position from within the block we are reading from.

This is important because:

  • We can ditch the global position tracker needed in the Python version of the script, which frees us up to do multithreaded gzip block decompression.
  • Therefore, we avoid annoying the Rust borrow checker by trying to do something like change the value of the reader after it has been “moved” (e.g. read into the buffer).

Let’s try this out on a real-word hard case, indexing the locations of the amino acid sequences in the entire ca29 library.

Nextflow pipeline screenshot of indexing

A 4.5x speedup! We’re trying to parallelize as much as possible so we overshot the mark for how many CPUs bgzindex needs (and we overpaid for an instance in this case).

Concluding Remarks

Overall, I’ve been impressed at how easy it has been for me to get up running on Rust. Thanks to reliable, community-supported packages like noodles, I’ve been writing high-performing code after just a few months of learning the basics from “the book”.

I’ve been able to write code that is easy to read, simple to package up and distribute, and has allowed us to crunch through our massive datasets even faster and more reliably.

(Feature photo by Joshua Sortino on Unsplash)

Optimizing Enzyme Expression and Performance with Zymtronix

Optimizing enzymes for Zymtronix’s cell-free manufacturing platform

Today we are pleased to announce our partnership with Zymtronix, a developer of cell-free process technologies. Together we aim to optimize enzymes used in Zymtronix’s proprietary cell-free platform for the production of important ingredients in food, agriculture, cosmetics and pharmaceuticals.

Enzymatic biocatalysis is a powerful manufacturing technology that can enable the production of a wide range of chemicals and molecules. Zymtronix’s cell-free platform is designed to solve challenges associated with traditional biocatalysis and seeks to enable the production of a wide range of products with precision and productivity. By partnering with us to build and produce bioengineered market-ready enzymes, Zymtronix anticipates being able to extend its solutions into the pharma, nutrition, agriculture markets, among others.

Zymtronix aims to leverage Ginkgo Enzyme Services to discover, optimize, and produce enzymes

Ginkgo Enzyme Services offers partners end-to-end support for the discovery, optimization, and production of enzymes for diverse applications. Through the partnership, we will leverage our suite of enzyme services to engineer enzymes for Zymtronix’s applications using metagenomic enzyme discovery as well as improve enzyme expression and production host performance.

We’re thrilled to welcome Zymtronix to the platform and support their applications in sustainable ingredients and beyond. We’ve built out our platform to serve a wide variety of enzyme discovery, engineering, optimization and scale up efforts, and we’re so excited for the work to come in this partnership. Zymtronix’s cell-free biomanufacturing platform is pioneering solutions for various industries, and we’re eager to leverage our end-to-end capabilities and help expand its efforts in transforming the way enzymes are used.

“This partnership will greatly accelerate our work of bringing the precision and scalability of cell-free biomanufacturing and sustainable ingredients to market starting with alternatives to animal sources; Ginkgo is uniquely able to support us with both enzyme engineering and strain expression, helping us continue to accelerate commercialization,” said Stéphane Corgié, CEO-CTO and founder, Zymtronix. “We hope to extend this partnership in the future to facilitate the production of multiple end-market products.”

To learn more about Ginkgo Enzyme Services, please visit ginkgobioworks.com/enzyme-services/.

Find the full press release here along with all of the latest news from the Ginkgo team.

What will you grow with Ginkgo?

Rapid Data Retrieval from Very Large Files


A core challenge for bioinformatics in metagenomics is that many standard bioinformatics tools are not built to handle datasets of the size of our metagenomic libraries. At the same time, the continued rapid evolution of the field and development of new methods requires that we maintain compatibility with the standard bioinformatic working environment – in other words, if we want to keep up with the state of the field, we must be able to provide command-line access to standard file types (FASTA, GFF, GBK, etc) generated from our data. We also need to be able to subset our data quickly and easily. For example, while we may need or want to operate on the whole library when performing homology searches, once we’ve completed that search, we’ll often want to work with just a subset of the search results for further analysis. In other words, we need: our entire library accessible in a commonly acceptable format, the ability to rapidly subset our data for additional downstream processing, and we need this full set of capabilities available within something that looks like a standard Linux command-line environment. Finally, experience has shown that the closer we can hew to using standard Linux-based tools, the easier it is for our scientists and bioinformaticians to work with our data, the lower the support burden, and the faster we can iterate on methods.

Plainly stated, we need a fast and easy way to store and extract subsets of data from extremely large text files. Let’s investigate some options.


We’ll be relying on the time command to give us the time a program takes to run, top for memory, and pv to provide us the speed at which a file is being read. time and top are standard Linux tools, and pv is available via the pv package in most distributions.


The specific problem we’re tackling today is providing fast access and retrieval of gene annotations from a GFF file covering an entire metagenomic library. The GFF format was chosen because it’s simple, supports multiple sequences in one file, allows us to store and retrieve just the gene annotations, is widely supported by standard tooling, and can be converted into other formats as needed.

The CA29 Library

We’ll be working with one library for the duration of this post – the CA29 soil bacterial library. We’ve chosen this one because it’s one of our largest libraries – if we’ve got a solution that works for CA29, we’ve got a solution that works for everything. If you’d like more information about our libraries, our recent paper Precision Discovery of Novel Inhibitors of Cancer Target HsMetAP1 from Vast Metagenomic Diversity includes details on their construction and use.

Some quick stats on the library, for reference:

  • Sequences: ~150M
  • Annotations: ~1.6B
  • Uncompressed Size: 235Gb
  • Compressed Size: 24Gb

Note that depending on the context below I may use either GiB or Gb, depending on what the tool I’m using puts out. In both cases, the term refers to 2^30, or 1024^3, bytes.


Our goal here is to be able to easily retrieve the gene calls for an individual sequence from our combined library. We’d like to do this quickly enough for it to be practical as part of other pipelines, which we’ll define as “less than a minutes to retrieve the GFF for a sequence,” and we’d also like to keep the computational resources down to an arbitrarily-chosen limit of 2 CPUs and 4Gb of memory, as this will fit inside almost any process we run and stands a very low probability of bankrupting the company.



The standard tool for retrieval of subsets of data from GFF files is tabix, which is part of the htslib package. Tabix works by creating an index file that maps sequence IDs to their positions in the compressed GFF file, and then using this index to provide rapid lookup and retrieval by sequence ID or sequence ID and location. Tabix, however, suffers from two major limitations for our use case – first, as of the latest version, Tabix has a bug which causes it to crash if the total length of all of the sequence IDs in the GFF file it is indexing exceeds 2^31 bytes. We use UUIDs for sequence names, and so this imposes a limit of around 50 million sequences, which is on the low side for one of our libraries. Second, Tabix’s memory use is extremely high for large datasets – when searching a 12Gb compressed GFF file, Tabix requires 32Gb of memory.

Not all’s lost, though. Under the hood, Tabix uses another utility, BGZip, to facilitate its quick retrieval. BGZip turns out to be a real gem, both in its use and construction.


BGZip makes use of a quirk in the widely-used and supported gzip file format – that you can concatenate multiple gzipped files together and the result is still a valid gzip archive. BGZip takes this to an extreme – when compressing a file, it splits the file into small chunks, gzips them independently, and then generates an index file mapping locations from the uncompressed file into the compressed file.

[            long_file.gff             ]
[ part 1 ][ part 2 ][ part 3 ][ part 4 ]
        ↓ Compressed and Indexed ↓
[a][b][c][d] + { part 1: a; part 2: b; ...}

This gives BGZip two neat tricks. One, which we’ll get back to later, is that it can compress and decompress multiple chunks at a time, since each chunk is fixed size and independent of its surroundings. The second is that it can retrieve subsections of the compressed file almost instantly:

# Relative size of the gzipped and uncompressed files:
$ ls -alh
total 259G
drwxrwxr-x 2 ubuntu ubuntu 4.0K Oct  4 21:31 .
drwxr-xr-x 9 ubuntu root   4.0K Oct  4 21:25 ..
-rw-rw-r-- 1 ubuntu ubuntu 235G Oct  4 21:35 ca29.gff
-rwxr-xr-x 1 ubuntu ubuntu  24G Oct  4 21:29 ca29.gff.gz
-rwxr-xr-x 1 ubuntu ubuntu  59M Oct  4 21:29 ca29.gff.gz.gzi

# ('.gzi' is bgzip's index - note it's only 60M for our 24Gb file)

# Retrieving a specific section of the uncompressed file, using `tail` and `head`:

$ time tail -c +251262670684 ca29.gff | head -c 60
##sequence-region fffff5f2-0bcb-4560-8399-c86c632969b8 1 125
real	0m0.001s
user	0m0.002s
sys	0m0.000s

# Retrieving a specific section of the compressed file using bgzip - note
# the start and size are the same as the uncompressed reads above:

$ time bgzip -b 251262670684 -s 60 ca29.gff.gz
#sequence-region fffff5f2-0bcb-4560-8399-c86c632969b8 1 1257
real	0m0.054s
user	0m0.025s
sys	0m0.029s

# Retrieval time does not scale with the size of the chunk retrieved:

$ time bgzip -b 251262670683 -s 62000000 ca29.gff.gz > /dev/null

real	0m0.054s
user	0m0.019s
sys	0m0.035s

So we’ve got the first part of our solution – nearly instantaneous retrieval of a subset of an excessively large gzipped file based on start location and size. By adding a sequence location index next to our GFF file with the sequence ID, start location, and size of the associated entries, we can get almost instant sequence retrieval:

# Our location index file:

$ head ca29.gff.loc.tsv
seq_id	offset	length
00000020-3ae5-4305-8de8-1bfceaa81f1c	16	1276
00000035-1d98-4718-960c-e1c0f9fda8c9	1292	937
00000057-0ac5-467f-a37f-3e64cf9928e8	2229	937
0000005c-da49-428c-95e5-342515731bc0	3166	937
00000076-c423-45c5-86b1-52fed94afbbf	4103	937
0000008a-7735-43f7-95fa-64ab84372ca1	5040	1617
000000f9-d5ca-4723-876d-b9942873d8cb	6657	602
0000015b-02d1-4c22-8b2f-8e723961f5fb	7259	939
00000197-45d6-4d40-8ab6-102b3e0000e8	8198	937

# Grab a random sequence from the end:

$ tail -n50 ca29.gff.loc.tsv | head -n1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

# Retrieve the gene calls from the compressed GFF:

$ time bgzip -b 251262762041 -s 935 ca29.gff.gz
##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093
fffffb2f-878d-4b23-a82f-573fbc1fad5f	annotation	remark	1	1093	.	.	.	molecule_type=DNA;topology=linear
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	region	1	1093	.	.	.	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	1	87	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	1	87	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	212	1093	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	212	1093	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2

real	0m0.059s
user	0m0.029s
sys	0m0.025s

Using grep, we can retrieve the locations for an individual sequence id:

$ grep fffffb2f-878d-4b23-a82f-573fbc1fad5f ca29.gff.loc.tsv
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

$ bgzip -b 251262762041 -s 935 ca29.gff.gz
##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093
fffffb2f-878d-4b23-a82f-573fbc1fad5f	annotation	remark	1	1093	.	.	.	molecule_type=DNA;topology=linear
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	region	1	1093	.	.	.	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	1	87	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	1	87	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	212	1093	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	212	1093	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2

We can even do this as a one-liner, if we’re clever:

# Isn't bash grand?
$ bgzip $( grep fffffb2f-878d-4b23-a82f-573fbc1fad5f ca29.gff.loc.tsv | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz
##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093
fffffb2f-878d-4b23-a82f-573fbc1fad5f	annotation	remark	1	1093	.	.	.	molecule_type=DNA;topology=linear
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	region	1	1093	.	.	.	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	1	87	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	1	87	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	212	1093	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	212	1093	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2

So, knowing nothing except the sequence ID we want to find and the library it’s in, we can quickly and efficiently grab all the gene calls for that sequence ID as a GFF file.

But, there’s a catch – our location file is Big! It contains one line for every sequence in the GFF file:

$ ls -alh ca29.gff.loc.tsv
-rw-rw-r-- 1 ubuntu ubuntu 7.8G Oct  4 22:02 ca29.gff.loc.tsv

So let’s dig into some options for dealing with this file.

Compressing the Index File

We now have an index file formatted as follows:

seq_id  offset  length
00000020-3ae5-4305-8de8-1bfceaa81f1c    16  1276
00000035-1d98-4718-960c-e1c0f9fda8c9    1292    937

Note this file is extremely dense – we have the sequence ID, offset, and length, and very little else. Consequently, the file doesn’t compress particularly well:

# Uncompressed:
$ du -hs ca29.gff.loc.tsv
7.8G	ca29.gff.loc.tsv

# Compressed:
$ du -hs ca29.gff.loc.tsv.gz
3.7G	ca29.gff.loc.tsv.gz

There’s just not a lot of slack to take out here. We could do something similar to BGZip and use a binary format, as opposed to just a standard ASCII file, and we might get some savings there, but then we’d need to build and distribute a toolchain for dealing with our proprietary index format, so that’s off the table for now.
Note also that the uncompressed index is comparable to the compressed GFF itself – 8Gb vs 24Gb, or about a third the size. We’d prefer to keep the compressed version. However, we quickly hit a problem: finding individual sequences from the gzipped file is a whole lot slower than the raw text format:

# Uncompressed:
$ time cat ca29.gff.loc.tsv | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

real	0m4.004s
user	0m2.328s
sys	0m4.776s

# Compressed:
$ time zcat ca29.gff.loc.tsv.gz | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

real	1m0.754s
user	1m0.947s
sys	0m5.050s

Yowza! A minute to look up a single sequence in our compressed location file? What’s going on?
Let’s pull out pv and see where our bottleneck is. pv is a lovely little tool that works roughly like cat, but will show how fast it’s passing data along. Linux pipes affect our speed, so we’ll throw cat in between pv and grep just to make things fair:

# Uncompressed:
$ time pv ca29.gff.loc.tsv | cat - | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f
7.73GiB 0:00:03 [1.95GiB/s] [==========================================>] 100%
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

real	0m3.745s
user	0m2.486s
sys	0m2.580s

# Compressed
$ time pv ca29.gff.loc.tsv.gz | zcat - | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f
3.62GiB 0:01:01 [60.3MiB/s] [==========================================>] 100%
fffffb2f-878d-4b23-a82f-573fbc1fad5f	251262762041	935

real	1m1.473s
user	1m1.319s
sys	0m5.835s

Decompressing our compressed archive slows us down from nearly 2GiB/s to a piddling 60MiB/s. Our gzipped location file is half the size, but takes 30x as long to retrieve sequences from. That’s not great. The difference reduces a bit if we decide to search for multiple sequence IDs:

# 10 sequence IDs to search for
$ cat seq_ids.txt

# Uncompressed:
$ time pv ca29.gff.loc.tsv | cat - | grep --file seq_ids.txt
fcdd2245-17b1-4a7a-9f67-1a34802aefd3	248192112609	943
fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6	248232660866	597
fd55e51f-6773-4eb4-871d-4174e99547a9	248655071744	597
fd581f3d-6ce0-4895-8b38-083c091a4729	248663484363	1276
fd7fae42-f746-4934-b71e-d738d8da31bc	248815519201	2291
fdd86a15-1848-4a2b-be05-f9d62a0ee538	249153912987	937
feed9c3a-9831-4c42-81a9-d072f728c9f2	250214540699	597
fef21bda-479c-445b-affd-d04761795174	250231619577	599
ff9fa0a4-7042-45bc-bf70-c502a1e49522	250893965638	941
ffc97fde-cf41-4440-b50b-86abad46f415	251054447078	937
7.73GiB 0:00:10 [ 781MiB/s] [==========================================>] 100%

real	0m10.132s
user	0m8.352s
sys	0m6.810s

# Compressed:
$ time pv ca29.gff.loc.tsv.gz | zcat - | grep --file seq_ids.txt > /dev/null
3.62GiB 0:01:02 [58.8MiB/s] [==========================================>] 100%

real	1m2.997s
user	1m8.761s
sys	0m5.691s

Not a big difference for our compressed reads – that’s obviously bottlenecked on decompression – but our uncompressed reads drop to around 800MiB/s, and it takes about 10s to get our results. This is probably as fast as we’re going to get with the index file approach, so let’s keep this as a benchmark.

First, though, let’s try a very different approach.

Using Sqlite3

What if we went with a database instead? Sqlite works fine as a command line utility, and it should certainly be faster than grepping through a text file – and indeed it is:

# The SQLite DB:
$ sqlite3 ca29.gff.sqlite3
SQLite version 3.38.2 2022-03-26 13:51:10
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE location (
sqlite> select * from location limit 5;

# Single sequence retrieval is almost instantaneous
$ time sqlite3 -tabs ca29.gff.sqlite3 'select * from location where contig_id = "fef21bda-479c-445b-affd-d04761795174"'
fef21bda-479c-445b-affd-d04761795174	250231619577	599

real	0m0.002s
user	0m0.002s
sys	0m0.000s

# Multiple sequence retrieval takes about the same time
$ time sqlite3 -tabs ca29.gff.sqlite3 "select * from location where contig_id in ('fef21bda-479c-445b-affd-d04761795174','fdd86a15-1848-4a2b-be05-f9d62a0ee538','fcdd2245-17b1-4a7a-9f67-1a34802aefd3','ffc97fde-cf41-4440-b50b-86abad46f415','ff9fa0a4-7042-45bc-bf70-c502a1e49522','feed9c3a-9831-4c42-81a9-d072f728c9f2','fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6','fd7fae42-f746-4934-b71e-d738d8da31bc','fd55e51f-6773-4eb4-871d-4174e99547a9','fd581f3d-6ce0-4895-8b38-083c091a4729')"
fcdd2245-17b1-4a7a-9f67-1a34802aefd3	248192112609	943
fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6	248232660866	597
fd55e51f-6773-4eb4-871d-4174e99547a9	248655071744	597
fd581f3d-6ce0-4895-8b38-083c091a4729	248663484363	1276
fd7fae42-f746-4934-b71e-d738d8da31bc	248815519201	2291
fdd86a15-1848-4a2b-be05-f9d62a0ee538	249153912987	937
feed9c3a-9831-4c42-81a9-d072f728c9f2	250214540699	597
fef21bda-479c-445b-affd-d04761795174	250231619577	599
ff9fa0a4-7042-45bc-bf70-c502a1e49522	250893965638	941
ffc97fde-cf41-4440-b50b-86abad46f415	251054447078	937

real	0m0.002s
user	0m0.002s
sys	0m0.000s

Great! The SQLite3 database lookup is almost immediate! So what’s the problem? Well:

$ du -hs ca29.gff.sqlite3
16G	ca29.gff.sqlite3

The problem is the database is on the order of 2/3rds the size of the compressed file it’s searching. This might be a fine trade-off in some cases – after all, it’s providing nearly instantaneous lookups, which might be worth the storage space:

$ time bgzip $( sqlite3 -tabs ca29.gff.sqlite3 'select * from location where contig_id = "fffffb2f-878d-4b23-a82f-573fbc1fad5f"' | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz
##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093
fffffb2f-878d-4b23-a82f-573fbc1fad5f	annotation	remark	1	1093	.	.	.	molecule_type=DNA;topology=linear
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	region	1	1093	.	.	.	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	1	87	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	1	87	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	CDS	212	1093	.	-	0	ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2
fffffb2f-878d-4b23-a82f-573fbc1fad5f	feature	exon	212	1093	.	-	.	gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2

real	0m0.058s
user	0m0.043s
sys	0m0.017s

But that’s an awful big trade, especially since our previous best of just grep-ing the raw text file for a single result was around 2 seconds. We’re not using this for an inline service, so the difference between 2 seconds and “immediate” really isn’t much for us. Let’s see if we can get any better from our index file.

Back to the Filesystem

So to recap, we’ve got an 8Gb uncompressed TSV file we can compress down to 4Gb, but then we see search times on the order of a minute or so for a set of sequence IDs.

Once again, taking a look at our timings, it’s clear the decompression time is our bottleneck. Taking grep out of the picture entirely makes this really clear:

# No pipes at all, just dump the file as fast as we can
$ time pv ca29.gff.loc.tsv > /dev/null
7.73GiB 0:00:01 [7.61GiB/s] [==========================================>] 100%

real	0m1.017s
user	0m0.016s
sys	0m1.000s

# Pipe to cat - notice we cut our speed roughly in half right
# off the bat if we want to consume our data:
$ time pv ca29.gff.loc.tsv | cat - > /dev/null
7.73GiB 0:00:01 [4.38GiB/s] [==========================================>] 100%

real	0m1.765s
user	0m0.030s
sys	0m2.838s

# Decompressing the gzipped file is what's killing us, though:
$ time pv ca29.gff.loc.tsv.gz | zcat - > /dev/null
3.62GiB 0:00:59 [62.4MiB/s] [==========================================>] 100%

real	0m59.334s
user	0m57.868s
sys	0m1.973s

Let’s add grep back in, just so we’ve got a picture of where we’re trying to get:

# Grep for our set of 10 sequence IDs
$ time pv ca29.gff.loc.tsv | grep --file seq_ids.txt > /dev/null
7.73GiB 0:00:09 [ 832MiB/s] [==========================================>] 100%

real	0m9.511s
user	0m8.164s
sys	0m2.644s

So grep itself is limited to about 800Mb/s for multiple matches.

A Quick Diversion into Alternate Greps

There are a couple alternatives to grep that promise faster searches – let’s take a quick look at ag and ripgrep to see what they offer:

# RipGrep:
$ time pv ca29.gff.loc.tsv | rg --file ./seq_ids.txt  > /dev/null
7.73GiB 0:00:18 [ 422MiB/s] [==========================================>] 100%

real	0m18.735s
user	0m15.821s
sys	0m10.066s

# AG, the "silver searcher" - AG does not support a patterns file:
$ time pv ca29.gff.loc.tsv | ag -F 'fef21bda-479c-445b-affd-d04761795174'
fef21bda-479c-445b-affd-d04761795174	250231619577	599
7.73GiB 0:00:20 [ 380MiB/s] [==========================================>] 100%

real	0m20.781s
user	0m15.630s
sys	0m19.217s

Both tools are half the speed of grep, and ag doesn’t support supplying multiple queries in a file. Generally, both of these tools are built to rapidly search a large tree of files, whereas we’re asking them to search one big file – as always, us humble bioinformaticians are left out in the cold. We’re not going to get saved by a different search utility – it looks like we’re stuck with grep.

Back to BGZip

So our bottleneck is decompression – remember earlier when I mentioned BGZip’s two big tricks? The multi-threaded decompression didn’t matter for us much before, since we were seeking specifically what we needed, but now we’re searching the whole file – let’s see if BGZip’s got something up its sleeve for us. As a reminder, here’s standard zcat:

# Standard zcat:
$ time pv ca29.gff.loc.gz | zcat - > /dev/null
3.61GiB 0:01:03 [58.2MiB/s] [==========================================>] 100%

real	1m3.615s
user	1m2.497s
sys	0m1.620s

And here’s bgzip with no threading, after we recompress the locations file using bgzip:

$ time pv ca29.gff.loc.tsv.gz | bgzip -c -d - > /dev/null
3.62GiB 0:00:24 [ 153MiB/s] [==========================================>] 100%

real	0m24.102s
user	0m22.340s
sys	0m1.819s

Already we’re three times as fast – because BGZip can trade between reading and decompressing and make some better assumptions about the source data, we’re not bottlenecked quite as badly on decompression. But let’s see what happens when we actually use the parallel decompression:

# Two threads
$ time pv ca29.gff.loc.tsv.gz | bgzip --threads 2 -c -d > /dev/null
3.62GiB 0:00:11 [ 320MiB/s] [==========================================>] 100%

real	0m11.573s
user	0m23.952s
sys	0m2.653s

# Four threads:
$ time pv ca29.gff.loc.tsv.gz | bgzip --threads 4 -c -d > /dev/null
3.62GiB 0:00:05 [ 640MiB/s] [==========================================>] 100%

real	0m5.788s
user	0m23.714s
sys	0m2.449s

# Eight threads:
$ time pv ca29.gff.loc.tsv.gz | bgzip --threads 8 -c -d > /dev/null
3.62GiB 0:00:02 [1.23GiB/s] [==========================================>] 100%

real	0m2.953s
user	0m24.297s
sys	0m2.621s

# 16 threads:
$ time pv ca29.gff.loc.tsv.gz | bgzip --threads 16 -c -d > /dev/null
3.62GiB 0:00:02 [1.49GiB/s] [==========================================>] 100%

real	0m2.439s
user	0m29.208s
sys	0m2.514s

These are enormous speedups in decompression, especially compared to the naive zcat approach – BGZip is doing some really fantastic work with threading here. You can see an almost linear increase in read speed right up through 8 threads, although we start to drop off above that – coordination takes its toll, eventually.

Let’s see what this looks like combined with grep. We’ll use 4 threads, and we’ll take pv out of the picture:

$ time bgzip --threads 4 -c -d ca29.gff.loc.tsv.gz | grep --file seq_ids.txt > /dev/null

real	0m10.091s
user	0m33.097s
sys	0m5.670s

10 seconds – that’s roughly what we were getting with the uncompressed search. BGZip’s not as fast as just reading an uncompressed file, but it’s fast enough to keep up with grep – decompression is no longer our bottleneck. BGZip is fast enough that we can keep our compressed index file with no speed sacrifice.

A Couple More Grep Tricks

Two more notes on grep to close out our testing.

First, we’ve been searching for multiple results in our index file, since that’s a relatively normal use case for us, but that actually imposes a pretty steep penalty on grep:

$ time pv ca29.gff.loc.tsv | grep "fef21bda-479c-445b-affd-d04761795174" > /dev/null
7.73GiB 0:00:03 [1.96GiB/s] [==========================================>] 100%

real	0m3.944s
user	0m2.644s
sys	0m2.616s

Fortunately, the real world effects here are pretty limited – BGZip is still fast enough that, roughly, this doesn’t matter:

$ time bgzip --threads 8 -c -d ca29.gff.loc.tsv.gz | grep --file just_one_seq_id.txt > /dev/null

real	0m4.138s
user	0m29.531s
sys	0m5.396s

One change we can make, though, that does make a difference is telling grep when to stop. We’re looking for exactly one line per sequence, so we shouldn’t ever get more lines than we have in our input. The -m flag tells grep how many results to expect:

# Grab a sequence ID towards the front of our file:
$ head -n100 ca29.gff.loc.tsv | tail -n1
00000c54-06f6-438b-9474-c56d4ff130b2	196289	932

# Grep without a hit of our count:
(base) [email protected]:/data/ca29$ time cat ca29.gff.loc.tsv | grep "00000c54-06f6-438b-9474-c56d4ff130b2" > /dev/null

real	0m3.134s
user	0m0.026s
sys	0m4.138s

# Tell grep how many results to expect:
(base) [email protected]:/data/ca29$ time cat ca29.gff.loc.tsv | grep -m1 "00000c54-06f6-438b-9474-c56d4ff130b2" > /dev/null

real	0m0.002s
user	0m0.000s
sys	0m0.002s

We’ve been playing a bit of a trick with our timings so far, because we’ve been using the most pathological case we can – 10 sequence IDs all towards the end of the file, which means grep needs to search the entire file to find our results. In more realistic use cases, we’d have a more normally distributed set of sequence IDs, and adding the -m flag to our searches will make a much bigger difference. The takeaway here is that the 10 seconds we saw above could be considered our worst-case scenario – at worst, it will take 10 seconds to retrieve a set of results from our file. We can live with that.


Final Layout

Our final layout for our library includes four files:

  • ca29.gff.gz (24Gb) – The full BGZip-compressed GFF file
  • ca29.gff.gz.gzi (59Mb) – BGZip’s index file
  • ca29.gff.loc.gz (3.7Gb) – The compressed sequence location TSV file
  • ca29.gff.loc.gz.gzi (2.0Mb) – BGZip’s index file

The total space for all four files is around 28Gb, or around 12% the size of our original uncompressed GFF library. Retrieving a single sequence can be performed in one line of bash with only the sequence ID and the library in a matter of seconds:

$ time bgzip $( bgzip -c -d --threads 4 ./ca29.gff.loc.tsv.gz | grep -m1 -F "78ef59ad-94b9-4944-9a3e-e1c171fc0df5" | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz > 78ef59ad-94b9-4944-9a3e-e1c171fc0df5.gff

real	0m2.803s
user	0m12.415s
sys	0m2.399s

$ cat 78ef59ad-94b9-4944-9a3e-e1c171fc0df5.gff
##sequence-region 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 1 1620
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	annotation	remark	1	1620	.	.	.	molecule_type=DNA;topology=linear
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	feature	region	1	1620	.	.	.	ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	feature	CDS	2	424	.	-	0	ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_number=1;name=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	feature	exon	2	424	.	-	.	gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_number=1
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	feature	CDS	435	1619	.	+	0	ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_number=2;name=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2
78ef59ad-94b9-4944-9a3e-e1c171fc0df5	feature	exon	435	1619	.	+	.	gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_number=2

Closing Remarks

Our original goal was to facilitate easy, fast, and cheap extraction of gene calls for a specific sequence from a library-level dataset. BGZip, part of the Samtools/HTSLib package, provides blazing fast indexed access into arbitrarily large gzip-compatible compressed files, with a very small space penalty for its index file. In addition, BGZip’s threaded compression & decompression, as well as some other optimizations, allow us to retrieve specific sequence locations from a compressed text location file as quickly as if the file were not compressed. The combination of these two features give us the ability to retrieve information for one or more sequences from our dataset in a matter of seconds.

Taken together, this allows us to keep our gene calls available on disk in a common bioinformatics format and to work with those files using standard *nix tools. Within our pipelines, we can go from a list of sequence IDs to the gene calls for those individual sequences in seconds. The approach generalizes to other file types as well – any file format in which the data for an individual item is written contiguously in the file and in which a single item can be read individually – such as FASTAs, JSONL, and others – is amenable to this approach.

In our next post, we will explore even faster ways to retrieve data from very large files!

(Feature photo by Joshua Sortino on Unsplash)

Biobased Alternatives to Synthetic Polymers with Bioweg

Developing cost-effective biobased materials as clean alternatives to synthetic polymers

Today, we’re announcing a new collaboration with Bioweg, a producer of highly functional and customizable biobased materials. Our partnership aims to optimize the production of bacterial cellulose and to produce novel variants of cellulose with improved performance to serve a variety of end markets.

Bioweg’s products are made of biodegradable bacterial cellulose and have already been tested and implemented by companies as an effective substitute for widely used synthetic polymers such as acrylates, polyethylene, and polystyrene. Synthetic polymers often appear as microbeads (i.e., micropowders) and texturants (i.e., Rheology modifiers) in products throughout cosmetics, homecare, personal care, agricultural coatings, and other industries, which contribute to microplastic pollution in waters worldwide. It is estimated that an average person could be ingesting about 5 grams of plastic each week (PDF) through the consumption of common foods and beverages. These microplastics are non-biodegradable and may carry toxic chemicals. Regulatory agencies and communities around the world are starting to regulate microplastics contamination. Just last year, the European Chemical Agency announced phasing out microbeads in ‘rinse-off’ and ‘leave-on’ cosmetics.

Leveraging our strain engineering and screening capabilities to deliver biobased solutions at scale

“Consumers and companies are united in their commitment to finding better performing and more sustainable alternatives for everyday products to break the chain of microplastic pollution. Our solutions are not just tackling a major environmental, sustainability and health problem, but also present a robust market opportunity to replace plastic polymers in care, coatings, chemicals, and other industries,” said Prateek Mahalwar, CEO at Bioweg. “We believe Ginkgo’s strain engineering and screening capabilities can enable us to deliver our biobased solutions at scale and competitive pricing.”

Bioweg is addressing a significant need in the marketplace to develop and produce a new generation of clean alternatives to synthetic polymers. We are committed to supporting the shift to sustainable and biobased high-performance alternatives and are thrilled to be working with Bioweg to address the pressing issues of microplastics contamination and promote responsible consumption.

Find the full press release here along with all of the latest news from the Ginkgo team.

What will you grow with Ginkgo?

Engineering Bacteria for Cancer Patient Treatment with Prokarium

Partnership with Prokarium to discover multiple targets for RNA therapeutics and immuno-oncology

Today, we’re announcing a new partnership with Prokarium – a biopharmaceutical company pioneering the oncology field of microbial immunotherapy – to develop a bactofection platform to deliver RNA-based therapeutics.

Most gene therapies today use viral delivery systems, which may have limited utility due to toxicity and targeting issues. Bactofection, the process of transferring genetic material into a mammalian cell via a bacterium, is an alternative gene delivery system that could deliver therapeutic agents to a patient. In bactofection, the naturally occurring tumor-colonizing characteristics of bacterial species, such as Salmonella, can be modified via genetic manipulation and harnessed to be a targeted delivery vehicle for various therapeutic payloads.

Developing bactofection platform to treat cancer patients

In collaboration with Prokarium, Ginkgo will set out to engineer a Salmonella-based bactofection platform for the delivery of RNA payloads to treat cancer patients, building on existing capabilities in RNA therapeutics, viral-based gene therapy and bacterial therapeutics.

“By leveraging Ginkgo’s microbial and mammalian foundry capabilities, we are building a highly versatile and innovative bactofection platform to support delivery of novel modalities from the ground up,” said Kristen Albright, CEO at Prokarium. “Through this partnership, we are working to unlock a new generation of immuno-oncology therapeutics.”

Advancements in immunotherapy such as checkpoint inhibitors and CAR-T have transformed the treatment of certain cancers, with tremendous progress across novel modalities continuing to build. With Prokarium, we will work to develop a bactofection platform that leverages the convergence of these advancements in immuno-oncology, gene therapy, RNA therapeutics and bacterial therapeutics. We are  excited to bolster our work with leading companies like Prokarium, which can utilize our services to innovate and expand what is possible with novel therapeutics.

Find the full press release here along with all of the latest news from the Ginkgo team.

What will you grow with Ginkgo?

Joining BDO Zone Strategic Alliance

As a partner in the technology group, Ginkgo will help de-risk and accelerate project development in BDO zones

Ginkgo is pleased to announce that we’ve joined the Bioeconomy Development Opportunity (BDO) Zone Strategic Alliance as a partner in the Technology Group.

BDO Zone Strategic Alliance Partners include some of the leading companies in the bioenergy industry that help de-risk biobased project development in BDO Zones.

Our vision of a thriving bioeconomy is aligned with the BDO Zone Initiative’s own goal of driving, accelerating and catalyzing biobased investment and commercial project development in communities across the US. As the Initiative aims to create 1,000 BDO Zones, we are thrilled to contribute efforts towards de-risking bio-based investments. By joining the Initiative, we also have the unique opportunity to better understand and address local customer needs for synthetic biology applications, while ultimately driving more value for biorefineries via the application of our synbio platform.

“The BDO Zone Initiative welcomes Ginkgo Bioworks as the newest addition to the growing Strategic Alliance,” said Jordan Solomon, President of Ecostrat. “Ginkgo’s robust technology, capabilities and connections in the biomanufacturing value chain will be fundamental to accelerating the bio-based economy in areas such as connecting feedstock supply, financial investment, bio-technology, process design, and construction to grow bio-manufacturing capacity.”

American biomass provides an economic development engine with the potential to create 160,000 jobs and nearly $15 billion of economic benefit. In Canada, the national economic impact potential is estimated at 16,060 direct, indirect, and induced jobs and over $1.48 billion annually of economic benefit. The BDO Zone Initiative can help North America realize this potential by supporting new market development for bioenergy, advanced biofuels, biobased heat and power, bio-materials, and clean hydrogen.

Find all the latest news from the Ginkgo team here.

What will you grow with Ginkgo?

The Enzyme Intelligence Virtual Event

Enzymes are having a moment

These days, there’s a lot going on in the enzyme space. Machine learning (ML) is changing the way we think about the previously hard problems of modeling protein structures and predicting protein functions. Enzyme-driven industries like pharmaceutical manufacturing, diagnostics, and industrial green chemistry are looking forward to new and better enzymes made possible by ML.

At Ginkgo, we love technologies that make biology easier to engineer. We’ve built an ML-guided enzyme engineering stack that draws on our Foundry’s established strengths in assembling smart DNA libraries, engineering organisms, and screening in (ultra) high throughput. The result is the new Ginkgo Enzyme Services offering that enables our customers to discover new enzymes and optimize existing ones. We’re super excited to share the results being delivered.

So let’s take a moment to talk about enzymes

On December 15th, Ginkgo hosted the Enzyme Intelligence Virtual Event. Dr. Emily Wrenbeck, Head of Protein Engineering, presented to an audience of more than 450 about how Ginkgo’s Foundry combines ML with massive experimental datasets to enable enzyme engineering.

The presentation was structured around 4 case studies:

  • Enzyme Discovery. An ML-guided search of a large metagenomic library led Ginkgo to an unexpected corner of sequence space, producing dozens of candidate enzymes and opening up IP options for the customer.
  • Better Activity. A customer’s project-critical enzyme had a reputation for being recalcitrant to engineering. Iterative protein design cycles produced a 10-fold improvement in activity over wild type as measured by kcat/KM.
  • Better Specificity. A naturally sourced enzyme made 4 undesired byproducts, with the desired product only 10% of the total. Ginkgo tested over 10,000 variants with ultra high throughput GC-MS to bring the product specificity to over 75%
  • Better Biocatalysts. A customer’s enzyme suffered from a low S:H (synthesis:hydrolysis) ratio. In a single round of engineering, Ginkgo was able to improve both enzyme speed and specificity under industrially relevant process conditions.

A recurring theme in the event was the value added by integrating ML tools with the broader capabilities of Ginkgo’s foundry. An engineering cycle that brings together DNA assembly, assay development, high-throughput screening, computational tools, and human experts is required to consistently deliver improved enzyme performance.

Far from replacing data, ML makes data more valuable. The massive biological datasets Ginkgo can generate were key to the success of these enzyme engineering programs.

It’s time for end-to-end Enzyme Services

Following the technical presentation, Ginkgo’s Business Development team took the stage to answer questions about Ginkgo’s latest offering. Building on powerful ML-guided enzyme engineering capabilities, Gingko Enzyme Services offers end-to-end solutions for R&D leaders looking for better enzymes.

  • Novel Enzyme Discovery. Ginkgo’s large in-house metagenomic libraries are combined with ultra high throughput screening to identify new enzymes with a desired activity. These enzymes are often unrelated to previously described candidates, opening up new IP options for customers.
  • Function & Stability Optimization. ML-guided protein design is combined with a range of sequence-based, structure-based, and classical enzymology tools to improve a candidate enzyme’s activity, stability, specificity or expression level.
  • Robust Host Strains. Ginkgo customers have access to proprietary bacterial and fungal host strains optimized for protein production, including methanol-free Pichia pastoris and low viscosity Aspergillus niger.
  • Fermentation Process Development. When it comes to commercial-scale enzyme production, Ginkgo offers automated DOE (Design of Experiments) in Ambr® 250 bioreactors to optimize the production process.
  • Manufacturing Scale-Up & Tech Transfer. Ginkgo can perform pilot-scale fermentation up to 3,000 L. Customers are supported in the development of downstream processing and purification methods. Ginkgo can transfer processes to a manufacturing partner, including for GMP production.

Make the most of this moment for enzymes

A key theme of the Enzyme Services portion of the event was the flexibility of Ginkgo’s offering. As a horizontal platform, Ginkgo supports customers through the entire enzyme R&D process from discovery to development and deployment. This versatility is essential to serve the diverse needs of customers across a range of enzyme-powered industries including: animal feed, biocatalysis, diagnostics, green chemistry, metabolic engineering, mRNA manufacturing, waste remediation, and more!

A recording of the Enzyme Intelligence Virtual Event is available here. Ready to engage Ginkgo’s Foundry for your project? Have questions? Contact [email protected]

What will you grow with Ginkgo?

Launching Ginkgo Enzyme Services

Enzyme Services to Enable Applications across Pharmaceuticals and Diagnostics, Food and Agriculture, and Beyond

Today we are very pleased to announce the launch of Ginkgo Enzyme Services. Ginkgo Enzyme Services is powered by ultra high throughput screening and machine learning-guided protein design, as well as optimized proprietary bacterial and fungal host strains. Ginkgo Enzyme Services solves challenges for R&D teams developing enzymes, from discovery of novel enzyme activity through optimization of enzyme function and large scale manufacturing.

A virtual event on Dec. 15 will give an overview of Ginkgo’s Enzyme Intelligence approach to machine learning-guided enzyme engineering

Enzymes are valuable biocatalysts used across a wide range of industries including in the production of medicines, food, materials, and beyond. Our end-to-end Enzyme Services support R&D leaders looking to identify new enzyme activity to replace existing chemical synthesis steps, improve enzyme specificity, activity, and stability in industrially relevant conditions, and optimize the manufacturing of enzymes for reduced cost of goods and environmental impact.

We’ve supported enzyme R&D programs across a wide range of industries, including biopharma manufacturing and discovery. Notable enzyme services projects include breakthrough work with Aldevron to improve the manufacturing efficiency of vaccinia capping enzyme, a critical reagent used in the production of mRNA vaccines, and a recently announced partnership with Merck to develop biocatalysts for active pharmaceutical ingredient (API) manufacturing.

Our suite of services covers the full end-to-end process for enzyme R&D, providing synergies between enzyme sequence, host strain, and manufacturing processes that can enable commercial success.

Ginkgo Enzyme Services includes:

  • Novel Enzyme Discovery
  • Enzyme Function & Stability Optimization
  • Access to Optimized Host Strains for Robust Expression
  • Optimized Fermentation Process Development
  • Manufacturing Scale-Up, Process Development & Tech Transfer

“Most R&D teams working on developing enzymes expect to need to stitch together a bunch of different services and tools, both in-house and external to make their enzymes work,” said Jake Janey, PhD, a pioneer in the field of biocatalysis. “The ability for Ginkgo to guide the process all the way, providing many intermediate touchpoints with data and prototypes for development and decision making, is a great value-add.”

At Ginkgo, we are constantly working to improve our platform to provide best-in-class services to enable our customers to meet their R&D challenges head on. As a horizontal platform, we have the flexibility, breadth, and scale to serve customers across biocatalysis, diagnostics, and beyond with the full spectrum of tools they need to discover, develop, and deploy enzymes for their processes and products.

With world class expression hosts, automation, computational design, and fermentation capabilities, we are excited to partner with companies across all industries to bring their enzyme-dependent products from conception to market more swiftly and reliably than ever before.

Virtual event details:

Join us for Enzyme Intelligence, a virtual event featuring Ginkgo’s head of protein engineering, Emily Wrenbeck, on December 15, 2022. Sign up here.

Learn more about Ginkgo Enzyme Services at our webpage, or write to us at [email protected].

Find the full press release here along with all of the latest news from the Ginkgo team.

What will you grow with Ginkgo?

Celebrating Cronos Group’s First-of-a-Kind CBC Product

Cronos Unveils its New CBC Product, the Spinach FEELZ™ Day Trip Gummies with THC+CBC, Developed on the Ginkgo Platform

We’re so excited to congratulate our customer Cronos – an innovative global cannabinoid company – for launching a CBC-focused product, the Spinach FEELZ™ THC+CBC Day Trip Mango Lime gummies, utilizing our platform for organism design and development.

The Spinach FEELZ™ Day Trip gummies are the first CBC gummy product in Canada and the first of its kind to feature a 3:1 ratio of CBC to THC. The product is currently available in Alberta and British Columbia and will be rolled out to additional provinces over the coming weeks:

  • SPINACH FEELZ™ THC+CBC DAY TRIP GUMMIES: Grab your bag, your friends and get going! A new day’s adventure awaits with Spinach FEELZ™ Day Trip gummies. From sun-up to sun-down, feel at ease and in tune with all the scents, sights, and sounds this glorious world has to offer. These one-of-a-kind THC+CBC gummies are packed with delicious mango-lime flavors and are sure to make for good times with friends. Five sour-then-sweet gummies with 10mg of THC and 30mg of CBC per pack.

What will you grow with Ginkgo?

Optimizing Production of Biobased Speciality Chemicals with Lygos

Strain development program aims to accelerate efforts to create sustainable chemicals

Today we’re pleased to announce our multi-product research and development collaboration with Lygos. This partnership is designed to optimize and scale production of sustainable specialty ingredients that can replace toxic petrochemistry, reclaim local manufacturing, and advance industrial bio-innovation.

Together, we plan to advance two research and development programs over the course of approximately two years. Lygos’ organic acid targets are used to produce biodegradable formulations and polymer-based products used in consumer, agricultural and industrial markets. We at Ginkgo will leverage our extensive expertise in strain development and metabolic engineering as we work to rapidly design and optimize microorganisms that can convert low-cost sugar to high-value chemicals. This in turn can provide a more sustainable alternative to the traditional industrial chemicals that are made from petroleum-derived feedstock.

Creating sustainable biomanufacturing solutions

Many specialty organic acids rely on environmentally damaging and costly production processes. We’re excited to be working with Lygos to replace some of those harmful production methods with sustainable, environmentally friendly biomanufacturing processes. Ginkgo is committed to creating solutions that are better for the planet, and this partnership with Lygos will help advance our initiatives around climate sustainability while supporting domestic manufacturing and technology development.

“Lygos and Ginkgo have been at the forefront of this field, enabling customers to create sustainable products that can help solve the world’s environmental challenges and improve everyday products,” said Eric Steen, PhD, CEO at Lygos.  “The new partnership will enable us to augment our development timelines, allocate more research and development, and accelerate our product commercialization programs.”

Find the full press release here along with all of the latest news from the Ginkgo team.

What will you grow with Ginkgo?