Background
A core challenge for bioinformatics in metagenomics is that many standard bioinformatics tools are not built to handle datasets of the size of our metagenomic libraries. At the same time, the continued rapid evolution of the field and development of new methods requires that we maintain compatibility with the standard bioinformatic working environment – in other words, if we want to keep up with the state of the field, we must be able to provide command-line access to standard file types (FASTA, GFF, GBK, etc) generated from our data. We also need to be able to subset our data quickly and easily. For example, while we may need or want to operate on the whole library when performing homology searches, once we’ve completed that search, we’ll often want to work with just a subset of the search results for further analysis. In other words, we need: our entire library accessible in a commonly acceptable format, the ability to rapidly subset our data for additional downstream processing, and we need this full set of capabilities available within something that looks like a standard Linux command-line environment. Finally, experience has shown that the closer we can hew to using standard Linux-based tools, the easier it is for our scientists and bioinformaticians to work with our data, the lower the support burden, and the faster we can iterate on methods.
Plainly stated, we need a fast and easy way to store and extract subsets of data from extremely large text files. Let’s investigate some options.
Tools
We’ll be relying on the time
command to give us the time a program takes to run, top
for memory, and pv
to provide us the speed at which a file is being read. time
and top
are standard Linux tools, and pv
is available via the pv
package in most distributions.
Setup
The specific problem we’re tackling today is providing fast access and retrieval of gene annotations from a GFF file covering an entire metagenomic library. The GFF format was chosen because it’s simple, supports multiple sequences in one file, allows us to store and retrieve just the gene annotations, is widely supported by standard tooling, and can be converted into other formats as needed.
The CA29 Library
We’ll be working with one library for the duration of this post – the CA29 soil bacterial library. We’ve chosen this one because it’s one of our largest libraries – if we’ve got a solution that works for CA29, we’ve got a solution that works for everything. If you’d like more information about our libraries, our recent paper Precision Discovery of Novel Inhibitors of Cancer Target HsMetAP1 from Vast Metagenomic Diversity includes details on their construction and use.
Some quick stats on the library, for reference:
- Sequences: ~150M
- Annotations: ~1.6B
- Uncompressed Size: 235Gb
- Compressed Size: 24Gb
Note that depending on the context below I may use either GiB or Gb, depending on what the tool I’m using puts out. In both cases, the term refers to 2^30, or 1024^3, bytes.
Goals
Our goal here is to be able to easily retrieve the gene calls for an individual sequence from our combined library. We’d like to do this quickly enough for it to be practical as part of other pipelines, which we’ll define as “less than a minutes to retrieve the GFF for a sequence,” and we’d also like to keep the computational resources down to an arbitrarily-chosen limit of 2 CPUs and 4Gb of memory, as this will fit inside almost any process we run and stands a very low probability of bankrupting the company.
Approaches
Tabix
The standard tool for retrieval of subsets of data from GFF files is tabix, which is part of the htslib
package. Tabix works by creating an index file that maps sequence IDs to their positions in the compressed GFF file, and then using this index to provide rapid lookup and retrieval by sequence ID or sequence ID and location. Tabix, however, suffers from two major limitations for our use case – first, as of the latest version, Tabix has a bug which causes it to crash if the total length of all of the sequence IDs in the GFF file it is indexing exceeds 2^31 bytes. We use UUIDs for sequence names, and so this imposes a limit of around 50 million sequences, which is on the low side for one of our libraries. Second, Tabix’s memory use is extremely high for large datasets – when searching a 12Gb compressed GFF file, Tabix requires 32Gb of memory.
Not all’s lost, though. Under the hood, Tabix uses another utility, BGZip, to facilitate its quick retrieval. BGZip turns out to be a real gem, both in its use and construction.
BGZip
BGZip makes use of a quirk in the widely-used and supported gzip
file format – that you can concatenate multiple gzipped files together and the result is still a valid gzip archive. BGZip takes this to an extreme – when compressing a file, it splits the file into small chunks, gzips them independently, and then generates an index file mapping locations from the uncompressed file into the compressed file.
[ long_file.gff ] ↓ [ part 1 ][ part 2 ][ part 3 ][ part 4 ] ↓ Compressed and Indexed ↓ [a][b][c][d] + { part 1: a; part 2: b; ...}
This gives BGZip two neat tricks. One, which we’ll get back to later, is that it can compress and decompress multiple chunks at a time, since each chunk is fixed size and independent of its surroundings. The second is that it can retrieve subsections of the compressed file almost instantly:
# Relative size of the gzipped and uncompressed files: $ ls -alh total 259G drwxrwxr-x 2 ubuntu ubuntu 4.0K Oct 4 21:31 . drwxr-xr-x 9 ubuntu root 4.0K Oct 4 21:25 .. -rw-rw-r-- 1 ubuntu ubuntu 235G Oct 4 21:35 ca29.gff -rwxr-xr-x 1 ubuntu ubuntu 24G Oct 4 21:29 ca29.gff.gz -rwxr-xr-x 1 ubuntu ubuntu 59M Oct 4 21:29 ca29.gff.gz.gzi # ('.gzi' is bgzip's index - note it's only 60M for our 24Gb file) # Retrieving a specific section of the uncompressed file, using `tail` and `head`: $ time tail -c +251262670684 ca29.gff | head -c 60 ##sequence-region fffff5f2-0bcb-4560-8399-c86c632969b8 1 125 real 0m0.001s user 0m0.002s sys 0m0.000s # Retrieving a specific section of the compressed file using bgzip - note # the start and size are the same as the uncompressed reads above: $ time bgzip -b 251262670684 -s 60 ca29.gff.gz #sequence-region fffff5f2-0bcb-4560-8399-c86c632969b8 1 1257 real 0m0.054s user 0m0.025s sys 0m0.029s # Retrieval time does not scale with the size of the chunk retrieved: $ time bgzip -b 251262670683 -s 62000000 ca29.gff.gz > /dev/null real 0m0.054s user 0m0.019s sys 0m0.035s
So we’ve got the first part of our solution – nearly instantaneous retrieval of a subset of an excessively large gzipped file based on start location and size. By adding a sequence location index next to our GFF file with the sequence ID, start location, and size of the associated entries, we can get almost instant sequence retrieval:
# Our location index file: $ head ca29.gff.loc.tsv seq_id offset length 00000020-3ae5-4305-8de8-1bfceaa81f1c 16 1276 00000035-1d98-4718-960c-e1c0f9fda8c9 1292 937 00000057-0ac5-467f-a37f-3e64cf9928e8 2229 937 0000005c-da49-428c-95e5-342515731bc0 3166 937 00000076-c423-45c5-86b1-52fed94afbbf 4103 937 0000008a-7735-43f7-95fa-64ab84372ca1 5040 1617 000000f9-d5ca-4723-876d-b9942873d8cb 6657 602 0000015b-02d1-4c22-8b2f-8e723961f5fb 7259 939 00000197-45d6-4d40-8ab6-102b3e0000e8 8198 937 # Grab a random sequence from the end: $ tail -n50 ca29.gff.loc.tsv | head -n1 fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 # Retrieve the gene calls from the compressed GFF: $ time bgzip -b 251262762041 -s 935 ca29.gff.gz ##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093 fffffb2f-878d-4b23-a82f-573fbc1fad5f annotation remark 1 1093 . . . molecule_type=DNA;topology=linear fffffb2f-878d-4b23-a82f-573fbc1fad5f feature region 1 1093 . . . ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 1 87 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 1 87 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 212 1093 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 212 1093 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2 real 0m0.059s user 0m0.029s sys 0m0.025s
Using grep
, we can retrieve the locations for an individual sequence id:
$ grep fffffb2f-878d-4b23-a82f-573fbc1fad5f ca29.gff.loc.tsv fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 $ bgzip -b 251262762041 -s 935 ca29.gff.gz ##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093 fffffb2f-878d-4b23-a82f-573fbc1fad5f annotation remark 1 1093 . . . molecule_type=DNA;topology=linear fffffb2f-878d-4b23-a82f-573fbc1fad5f feature region 1 1093 . . . ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 1 87 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 1 87 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 212 1093 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 212 1093 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2
We can even do this as a one-liner, if we’re clever:
# Isn't bash grand? $ bgzip $( grep fffffb2f-878d-4b23-a82f-573fbc1fad5f ca29.gff.loc.tsv | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz ##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093 fffffb2f-878d-4b23-a82f-573fbc1fad5f annotation remark 1 1093 . . . molecule_type=DNA;topology=linear fffffb2f-878d-4b23-a82f-573fbc1fad5f feature region 1 1093 . . . ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 1 87 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 1 87 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 212 1093 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 212 1093 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2
So, knowing nothing except the sequence ID we want to find and the library it’s in, we can quickly and efficiently grab all the gene calls for that sequence ID as a GFF file.
But, there’s a catch – our location file is Big! It contains one line for every sequence in the GFF file:
$ ls -alh ca29.gff.loc.tsv -rw-rw-r-- 1 ubuntu ubuntu 7.8G Oct 4 22:02 ca29.gff.loc.tsv
So let’s dig into some options for dealing with this file.
Compressing the Index File
We now have an index file formatted as follows:
seq_id offset length 00000020-3ae5-4305-8de8-1bfceaa81f1c 16 1276 00000035-1d98-4718-960c-e1c0f9fda8c9 1292 937
Note this file is extremely dense – we have the sequence ID, offset, and length, and very little else. Consequently, the file doesn’t compress particularly well:
# Uncompressed: $ du -hs ca29.gff.loc.tsv 7.8G ca29.gff.loc.tsv # Compressed: $ du -hs ca29.gff.loc.tsv.gz 3.7G ca29.gff.loc.tsv.gz
There’s just not a lot of slack to take out here. We could do something similar to BGZip and use a binary format, as opposed to just a standard ASCII file, and we might get some savings there, but then we’d need to build and distribute a toolchain for dealing with our proprietary index format, so that’s off the table for now.
Note also that the uncompressed index is comparable to the compressed GFF itself – 8Gb vs 24Gb, or about a third the size. We’d prefer to keep the compressed version. However, we quickly hit a problem: finding individual sequences from the gzipped file is a whole lot slower than the raw text format:
# Uncompressed: $ time cat ca29.gff.loc.tsv | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 real 0m4.004s user 0m2.328s sys 0m4.776s # Compressed: $ time zcat ca29.gff.loc.tsv.gz | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 real 1m0.754s user 1m0.947s sys 0m5.050s
Yowza! A minute to look up a single sequence in our compressed location file? What’s going on?
Let’s pull out pv
and see where our bottleneck is. pv
is a lovely little tool that works roughly like cat
, but will show how fast it’s passing data along. Linux pipes affect our speed, so we’ll throw cat
in between pv
and grep
just to make things fair:
# Uncompressed: $ time pv ca29.gff.loc.tsv | cat - | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f 7.73GiB 0:00:03 [1.95GiB/s] [==========================================>] 100% fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 real 0m3.745s user 0m2.486s sys 0m2.580s # Compressed $ time pv ca29.gff.loc.tsv.gz | zcat - | grep fffffb2f-878d-4b23-a82f-573fbc1fad5f 3.62GiB 0:01:01 [60.3MiB/s] [==========================================>] 100% fffffb2f-878d-4b23-a82f-573fbc1fad5f 251262762041 935 real 1m1.473s user 1m1.319s sys 0m5.835s
Decompressing our compressed archive slows us down from nearly 2GiB/s to a piddling 60MiB/s. Our gzipped location file is half the size, but takes 30x as long to retrieve sequences from. That’s not great. The difference reduces a bit if we decide to search for multiple sequence IDs:
# 10 sequence IDs to search for $ cat seq_ids.txt fef21bda-479c-445b-affd-d04761795174 fdd86a15-1848-4a2b-be05-f9d62a0ee538 fcdd2245-17b1-4a7a-9f67-1a34802aefd3 ffc97fde-cf41-4440-b50b-86abad46f415 ff9fa0a4-7042-45bc-bf70-c502a1e49522 feed9c3a-9831-4c42-81a9-d072f728c9f2 fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6 fd7fae42-f746-4934-b71e-d738d8da31bc fd55e51f-6773-4eb4-871d-4174e99547a9 fd581f3d-6ce0-4895-8b38-083c091a4729 # Uncompressed: $ time pv ca29.gff.loc.tsv | cat - | grep --file seq_ids.txt fcdd2245-17b1-4a7a-9f67-1a34802aefd3 248192112609 943 fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6 248232660866 597 fd55e51f-6773-4eb4-871d-4174e99547a9 248655071744 597 fd581f3d-6ce0-4895-8b38-083c091a4729 248663484363 1276 fd7fae42-f746-4934-b71e-d738d8da31bc 248815519201 2291 fdd86a15-1848-4a2b-be05-f9d62a0ee538 249153912987 937 feed9c3a-9831-4c42-81a9-d072f728c9f2 250214540699 597 fef21bda-479c-445b-affd-d04761795174 250231619577 599 ff9fa0a4-7042-45bc-bf70-c502a1e49522 250893965638 941 ffc97fde-cf41-4440-b50b-86abad46f415 251054447078 937 7.73GiB 0:00:10 [ 781MiB/s] [==========================================>] 100% real 0m10.132s user 0m8.352s sys 0m6.810s # Compressed: $ time pv ca29.gff.loc.tsv.gz | zcat - | grep --file seq_ids.txt > /dev/null 3.62GiB 0:01:02 [58.8MiB/s] [==========================================>] 100% real 1m2.997s user 1m8.761s sys 0m5.691s
Not a big difference for our compressed reads – that’s obviously bottlenecked on decompression – but our uncompressed reads drop to around 800MiB/s, and it takes about 10s to get our results. This is probably as fast as we’re going to get with the index file approach, so let’s keep this as a benchmark.
First, though, let’s try a very different approach.
Using Sqlite3
What if we went with a database instead? Sqlite works fine as a command line utility, and it should certainly be faster than grepping through a text file – and indeed it is:
# The SQLite DB: $ sqlite3 ca29.gff.sqlite3 SQLite version 3.38.2 2022-03-26 13:51:10 Enter ".help" for usage hints. sqlite> .schema CREATE TABLE location ( contig_id TEXT PRIMARY KEY, start INTEGER NOT NULL, length INTEGER NOT NULL ); sqlite> select * from location limit 5; seq_id|offset|length 00000020-3ae5-4305-8de8-1bfceaa81f1c|16|1276 00000035-1d98-4718-960c-e1c0f9fda8c9|1292|937 00000057-0ac5-467f-a37f-3e64cf9928e8|2229|937 0000005c-da49-428c-95e5-342515731bc0|3166|937 sqlite> # Single sequence retrieval is almost instantaneous $ time sqlite3 -tabs ca29.gff.sqlite3 'select * from location where contig_id = "fef21bda-479c-445b-affd-d04761795174"' fef21bda-479c-445b-affd-d04761795174 250231619577 599 real 0m0.002s user 0m0.002s sys 0m0.000s # Multiple sequence retrieval takes about the same time $ time sqlite3 -tabs ca29.gff.sqlite3 "select * from location where contig_id in ('fef21bda-479c-445b-affd-d04761795174','fdd86a15-1848-4a2b-be05-f9d62a0ee538','fcdd2245-17b1-4a7a-9f67-1a34802aefd3','ffc97fde-cf41-4440-b50b-86abad46f415','ff9fa0a4-7042-45bc-bf70-c502a1e49522','feed9c3a-9831-4c42-81a9-d072f728c9f2','fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6','fd7fae42-f746-4934-b71e-d738d8da31bc','fd55e51f-6773-4eb4-871d-4174e99547a9','fd581f3d-6ce0-4895-8b38-083c091a4729')" fcdd2245-17b1-4a7a-9f67-1a34802aefd3 248192112609 943 fce7cd9b-8cf9-41f7-a4a2-04a7814e41b6 248232660866 597 fd55e51f-6773-4eb4-871d-4174e99547a9 248655071744 597 fd581f3d-6ce0-4895-8b38-083c091a4729 248663484363 1276 fd7fae42-f746-4934-b71e-d738d8da31bc 248815519201 2291 fdd86a15-1848-4a2b-be05-f9d62a0ee538 249153912987 937 feed9c3a-9831-4c42-81a9-d072f728c9f2 250214540699 597 fef21bda-479c-445b-affd-d04761795174 250231619577 599 ff9fa0a4-7042-45bc-bf70-c502a1e49522 250893965638 941 ffc97fde-cf41-4440-b50b-86abad46f415 251054447078 937 real 0m0.002s user 0m0.002s sys 0m0.000s
Great! The SQLite3 database lookup is almost immediate! So what’s the problem? Well:
$ du -hs ca29.gff.sqlite3 16G ca29.gff.sqlite3
The problem is the database is on the order of 2/3rds the size of the compressed file it’s searching. This might be a fine trade-off in some cases – after all, it’s providing nearly instantaneous lookups, which might be worth the storage space:
$ time bgzip $( sqlite3 -tabs ca29.gff.sqlite3 'select * from location where contig_id = "fffffb2f-878d-4b23-a82f-573fbc1fad5f"' | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz ##sequence-region fffffb2f-878d-4b23-a82f-573fbc1fad5f 1 1093 fffffb2f-878d-4b23-a82f-573fbc1fad5f annotation remark 1 1093 . . . molecule_type=DNA;topology=linear fffffb2f-878d-4b23-a82f-573fbc1fad5f feature region 1 1093 . . . ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 1 87 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 1 87 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_1;gene_number=1 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature CDS 212 1093 . - 0 ID=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2;name=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2 fffffb2f-878d-4b23-a82f-573fbc1fad5f feature exon 212 1093 . - . gene_id=fffffb2f-878d-4b23-a82f-573fbc1fad5f_2;gene_number=2 real 0m0.058s user 0m0.043s sys 0m0.017s
But that’s an awful big trade, especially since our previous best of just grep-ing the raw text file for a single result was around 2 seconds. We’re not using this for an inline service, so the difference between 2 seconds and “immediate” really isn’t much for us. Let’s see if we can get any better from our index file.
Back to the Filesystem
So to recap, we’ve got an 8Gb uncompressed TSV file we can compress down to 4Gb, but then we see search times on the order of a minute or so for a set of sequence IDs.
Once again, taking a look at our timings, it’s clear the decompression time is our bottleneck. Taking grep out of the picture entirely makes this really clear:
# No pipes at all, just dump the file as fast as we can $ time pv ca29.gff.loc.tsv > /dev/null 7.73GiB 0:00:01 [7.61GiB/s] [==========================================>] 100% real 0m1.017s user 0m0.016s sys 0m1.000s # Pipe to cat - notice we cut our speed roughly in half right # off the bat if we want to consume our data: $ time pv ca29.gff.loc.tsv | cat - > /dev/null 7.73GiB 0:00:01 [4.38GiB/s] [==========================================>] 100% real 0m1.765s user 0m0.030s sys 0m2.838s # Decompressing the gzipped file is what's killing us, though: $ time pv ca29.gff.loc.tsv.gz | zcat - > /dev/null 3.62GiB 0:00:59 [62.4MiB/s] [==========================================>] 100% real 0m59.334s user 0m57.868s sys 0m1.973s
Let’s add grep back in, just so we’ve got a picture of where we’re trying to get:
# Grep for our set of 10 sequence IDs $ time pv ca29.gff.loc.tsv | grep --file seq_ids.txt > /dev/null 7.73GiB 0:00:09 [ 832MiB/s] [==========================================>] 100% real 0m9.511s user 0m8.164s sys 0m2.644s
So grep
itself is limited to about 800Mb/s for multiple matches.
A Quick Diversion into Alternate Greps
There are a couple alternatives to grep that promise faster searches – let’s take a quick look at ag
and ripgrep
to see what they offer:
# RipGrep: $ time pv ca29.gff.loc.tsv | rg --file ./seq_ids.txt > /dev/null 7.73GiB 0:00:18 [ 422MiB/s] [==========================================>] 100% real 0m18.735s user 0m15.821s sys 0m10.066s # AG, the "silver searcher" - AG does not support a patterns file: $ time pv ca29.gff.loc.tsv | ag -F 'fef21bda-479c-445b-affd-d04761795174' fef21bda-479c-445b-affd-d04761795174 250231619577 599 7.73GiB 0:00:20 [ 380MiB/s] [==========================================>] 100% real 0m20.781s user 0m15.630s sys 0m19.217s
Both tools are half the speed of grep
, and ag
doesn’t support supplying multiple queries in a file. Generally, both of these tools are built to rapidly search a large tree of files, whereas we’re asking them to search one big file – as always, us humble bioinformaticians are left out in the cold. We’re not going to get saved by a different search utility – it looks like we’re stuck with grep
.
Back to BGZip
So our bottleneck is decompression – remember earlier when I mentioned BGZip’s two big tricks? The multi-threaded decompression didn’t matter for us much before, since we were seeking specifically what we needed, but now we’re searching the whole file – let’s see if BGZip’s got something up its sleeve for us. As a reminder, here’s standard zcat
:
# Standard zcat: $ time pv ca29.gff.loc.gz | zcat - > /dev/null 3.61GiB 0:01:03 [58.2MiB/s] [==========================================>] 100% real 1m3.615s user 1m2.497s sys 0m1.620s
And here’s bgzip
with no threading, after we recompress the locations file using bgzip
:
$ time pv ca29.gff.loc.tsv.gz | bgzip -c -d - > /dev/null 3.62GiB 0:00:24 [ 153MiB/s] [==========================================>] 100% real 0m24.102s user 0m22.340s sys 0m1.819s
Already we’re three times as fast – because BGZip can trade between reading and decompressing and make some better assumptions about the source data, we’re not bottlenecked quite as badly on decompression. But let’s see what happens when we actually use the parallel decompression:
# Two threads $ time pv ca29.gff.loc.tsv.gz | bgzip --threads 2 -c -d > /dev/null 3.62GiB 0:00:11 [ 320MiB/s] [==========================================>] 100% real 0m11.573s user 0m23.952s sys 0m2.653s # Four threads: $ time pv ca29.gff.loc.tsv.gz | bgzip --threads 4 -c -d > /dev/null 3.62GiB 0:00:05 [ 640MiB/s] [==========================================>] 100% real 0m5.788s user 0m23.714s sys 0m2.449s # Eight threads: $ time pv ca29.gff.loc.tsv.gz | bgzip --threads 8 -c -d > /dev/null 3.62GiB 0:00:02 [1.23GiB/s] [==========================================>] 100% real 0m2.953s user 0m24.297s sys 0m2.621s # 16 threads: $ time pv ca29.gff.loc.tsv.gz | bgzip --threads 16 -c -d > /dev/null 3.62GiB 0:00:02 [1.49GiB/s] [==========================================>] 100% real 0m2.439s user 0m29.208s sys 0m2.514s
These are enormous speedups in decompression, especially compared to the naive zcat approach – BGZip is doing some really fantastic work with threading here. You can see an almost linear increase in read speed right up through 8 threads, although we start to drop off above that – coordination takes its toll, eventually.
Let’s see what this looks like combined with grep
. We’ll use 4 threads, and we’ll take pv
out of the picture:
$ time bgzip --threads 4 -c -d ca29.gff.loc.tsv.gz | grep --file seq_ids.txt > /dev/null real 0m10.091s user 0m33.097s sys 0m5.670s
10 seconds – that’s roughly what we were getting with the uncompressed search. BGZip’s not as fast as just reading an uncompressed file, but it’s fast enough to keep up with grep – decompression is no longer our bottleneck. BGZip is fast enough that we can keep our compressed index file with no speed sacrifice.
A Couple More Grep Tricks
Two more notes on grep
to close out our testing.
First, we’ve been searching for multiple results in our index file, since that’s a relatively normal use case for us, but that actually imposes a pretty steep penalty on grep
:
$ time pv ca29.gff.loc.tsv | grep "fef21bda-479c-445b-affd-d04761795174" > /dev/null 7.73GiB 0:00:03 [1.96GiB/s] [==========================================>] 100% real 0m3.944s user 0m2.644s sys 0m2.616s
Fortunately, the real world effects here are pretty limited – BGZip is still fast enough that, roughly, this doesn’t matter:
$ time bgzip --threads 8 -c -d ca29.gff.loc.tsv.gz | grep --file just_one_seq_id.txt > /dev/null real 0m4.138s user 0m29.531s sys 0m5.396s
One change we can make, though, that does make a difference is telling grep
when to stop. We’re looking for exactly one line per sequence, so we shouldn’t ever get more lines than we have in our input. The -m
flag tells grep how many results to expect:
# Grab a sequence ID towards the front of our file: $ head -n100 ca29.gff.loc.tsv | tail -n1 00000c54-06f6-438b-9474-c56d4ff130b2 196289 932 # Grep without a hit of our count: (base) ubuntu@npd-edanielson-work:/data/ca29$ time cat ca29.gff.loc.tsv | grep "00000c54-06f6-438b-9474-c56d4ff130b2" > /dev/null real 0m3.134s user 0m0.026s sys 0m4.138s # Tell grep how many results to expect: (base) ubuntu@npd-edanielson-work:/data/ca29$ time cat ca29.gff.loc.tsv | grep -m1 "00000c54-06f6-438b-9474-c56d4ff130b2" > /dev/null real 0m0.002s user 0m0.000s sys 0m0.002s
We’ve been playing a bit of a trick with our timings so far, because we’ve been using the most pathological case we can – 10 sequence IDs all towards the end of the file, which means grep
needs to search the entire file to find our results. In more realistic use cases, we’d have a more normally distributed set of sequence IDs, and adding the -m
flag to our searches will make a much bigger difference. The takeaway here is that the 10 seconds we saw above could be considered our worst-case scenario – at worst, it will take 10 seconds to retrieve a set of results from our file. We can live with that.
Conclusion
Final Layout
Our final layout for our library includes four files:
- ca29.gff.gz (24Gb) – The full BGZip-compressed GFF file
- ca29.gff.gz.gzi (59Mb) – BGZip’s index file
- ca29.gff.loc.gz (3.7Gb) – The compressed sequence location TSV file
- ca29.gff.loc.gz.gzi (2.0Mb) – BGZip’s index file
The total space for all four files is around 28Gb, or around 12% the size of our original uncompressed GFF library. Retrieving a single sequence can be performed in one line of bash with only the sequence ID and the library in a matter of seconds:
$ time bgzip $( bgzip -c -d --threads 4 ./ca29.gff.loc.tsv.gz | grep -m1 -F "78ef59ad-94b9-4944-9a3e-e1c171fc0df5" | awk '{ print "-b", $2, "-s", $3 }' ) ca29.gff.gz > 78ef59ad-94b9-4944-9a3e-e1c171fc0df5.gff real 0m2.803s user 0m12.415s sys 0m2.399s $ cat 78ef59ad-94b9-4944-9a3e-e1c171fc0df5.gff ##sequence-region 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 1 1620 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 annotation remark 1 1620 . . . molecule_type=DNA;topology=linear 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 feature region 1 1620 . . . ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 feature CDS 2 424 . - 0 ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_number=1;name=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 feature exon 2 424 . - . gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_1;gene_number=1 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 feature CDS 435 1619 . + 0 ID=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_number=2;name=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2 78ef59ad-94b9-4944-9a3e-e1c171fc0df5 feature exon 435 1619 . + . gene_id=78ef59ad-94b9-4944-9a3e-e1c171fc0df5_2;gene_number=2
Closing Remarks
Our original goal was to facilitate easy, fast, and cheap extraction of gene calls for a specific sequence from a library-level dataset. BGZip, part of the Samtools/HTSLib package, provides blazing fast indexed access into arbitrarily large gzip-compatible compressed files, with a very small space penalty for its index file. In addition, BGZip’s threaded compression & decompression, as well as some other optimizations, allow us to retrieve specific sequence locations from a compressed text location file as quickly as if the file were not compressed. The combination of these two features give us the ability to retrieve information for one or more sequences from our dataset in a matter of seconds.
Taken together, this allows us to keep our gene calls available on disk in a common bioinformatics format and to work with those files using standard *nix tools. Within our pipelines, we can go from a list of sequence IDs to the gene calls for those individual sequences in seconds. The approach generalizes to other file types as well – any file format in which the data for an individual item is written contiguously in the file and in which a single item can be read individually – such as FASTAs, JSONL, and others – is amenable to this approach.
In our next post, we will explore even faster ways to retrieve data from very large files!
(Feature photo by Joshua Sortino on Unsplash)