“What’s in the water?”
Metagenomics can help answer this question. Metagenomics is the study of the genetic makeup of communities of organisms. The communities can be derived from various environmental sources, including water samples, soil samples, internal organs or microbiomes, bioreactors, etc. – basically any source where multiple species of organisms reside.
Metagenomics has become an important research branch in Next Generation Sequencing (NGS). In NGS, a primary goal of metagenomics studies is to identify the species composition of environmental samples, in other words, to answer the question “what species are in the samples”. Recently, our NGS Core was engaged in an exciting metagenomics project that eventually engaged the services of the world’s largest public supercomputer, which we’ll describe here.
It’s a vast ecosystem of marshlands, estuarine swamps, and freshwater sloughs covering 1.5M acres in southern Florida. A variety of endangered and threatened species inhabit the park including American crocodile, American alligator, sea turtles, and numerous birds, plants, insects, and mammals. A similar number of invasive species are encroaching on the Park including several Python species, feral pigs, and a variety of exotic plant, fish, and invertebrate species.
Recently, the USDA National Wildlife Research Center (NWRC) in Fort Collins, CO, conducted a metagenomics study in order to identify endangered and invasive species in Everglades watersheds. Dr. Toni Piaggio at NWRC was the principal investigator for this research. The main goal of the study was to see if Next Generation Sequencing technology could be used to identify what species were present in water samples at various locations in the Park. Presumably, DNA shed from individuals coursing through water may be picked up by ultra-sensitive NextGen DNA sequencers. We tested this hypothesis in the NextGen Sequencing Core at Colorado State University.
The NextGen Core manages several DNA sequencers including the Life Technologies Ion Proton. The Proton generates about 120M “reads”, or DNA fragments, per sample, which is just sufficient to perform metagenomic data analysis. The Proton uses an intriguing semiconductor sequencing technology to generate DNA fragments.
(Semiconductor Sequencing Chip)
By leveraging the same CMOS integrated circuit design used in cell phones and other devices, the Proton can sequence DNA samples quickly, usually under 20 hours per run. It generates so-called “long reads” around 200 base pairs, which is important for subsequent data analysis. For metagenomics research, longer reads generally yield better results so we like to see read lengths > 100 base pairs if possible.
(Metagenomics Data Analysis Pipeline, adapted from Huson et al.)
Over a period of several years we’ve developed a metagenomics pipeline for the analysis of NextGen sequencing data, shown above. We tweak the pipeline for specific projects depending on the goals of the research. For the Everglades project, we ran water samples on the Proton sequencer generating on average 100M reads per sample. Initially, we used mpiBLAST and the NCBI nucleotide database (nt) on the CSU Cray XT6m supercomputer for running DNA sequence alignments. mpiBLAST is an MPI-enabled parallelized version of the traditional BLAST algorithm for sequence alignments. The nt database includes DNA sequences from a broad and diverse collection of species including eukaryotes and prokaryotes. The BLAST alignment results are fed to MEGAN, a software tool for species identification and taxonomic classification.
It became apparent to us that mpiBLAST and the Cray XT6m were insufficient to handle the very large number of sequence alignments required by Proton sample sizes and the size of the nt database. The full nt database contains around 15M DNA fragments; the metagenomic sample datasets each contain about 100M reads. To perform an all-for-all database query would require 15M X 100M = 1.5e15 (1.5 quadrillion) sequence alignments. Utilizing 500 CPU cores on the XT6m for week-long runs, we could only query about 5% of the nt database. Clearly, this was well short of our goal of full queries of the nt database.
We also discovered that mpiBLAST did not scale well beyond a few hundred CPU cores. We expected strong scaling but instead found mpiBLAST going nearly asymptotic, thus limiting the number of sequence alignments achievable on the XT6m.
To improve database query and scaling performance, we received a Director’s Discretion grant on the Oak Ridge Titan Cray XK7 Supercomputer. The Titan includes about 300K CPU cores, considerably larger than the XT6m, which has about 2K CPU cores. We wanted to explore the capabilities of the Titan and its application to DNA sequence alignment algorithms.
Furthermore, our colleagues at Oak Ridge recently developed a new BLAST tool, HSP-BLAST (Highly Scalable Parallel BLAST), with vastly improved performance characteristics over mpiBLAST. HSP-BLAST showed nearly linear scaling with problem size, which simply means that algorithm speedup increases linearly with the number of CPU cores, i.e. doubling the number of CPU cores doubles the speedup of the code. This is particularly important when scaling up to, say, 100K cores on very large datasets. So we switched over to HSP-BLAST for all of our sequence alignment runs and large-scale database queries.
The combination of Titan hardware and HSP-BLAST software proved to be considerably faster than we anticipated. Consequently, for the Titan BLAST runs we concatenated several databases into a single monolithic database, which included the NCBI Nucleotide DB (GenBank, RefSeq), NCBI Environmental Nucleotide DB, Silva DB, Greengenes DB, and RDP DB. The combined database had approximately 45M records. The NCBI Nucleotide database includes DNA sequences from a broad and diverse group of species including eukaryotes and prokaryotes. As the name implies, the NCBI Environmental Nucleotide database consists mostly of DNA sequences from environmental samples. The Silva, Greengenes, and RDP databases include primarily ribosomal sequences that are useful in identifying microbial species.
The largest BLAST query was run on 100K CPU cores, about 1/3 of the Titan, consuming about 9-hours of time. However, most BLAST queries were on the order of a few thousand to a few tens of thousands of CPU cores; typically these jobs ran for a few to several hours. We saw near-linear scaling of HSP-BLAST in all runs, a very encouraging sign for this sequence alignment tool.
For the data analysis step, we imported Titan BLAST sequence alignment results into MEGAN. MEGAN assigns DNA reads to the NCBI taxonomy, which currently stands around 1M species. With MEGAN you can drill-up (or roll up) to the Kingdom phylogenetic level, or drill-down to species level.
The figure below shows the results of Titan BLAST metagenomic water samples at the species level for four sample sites in Everglades National Park.
(MEGAN Phylogenetic Classification)
Several thousand species are represented by the left-hand vertical column, including archaea, prokaryotes, eukaryotes, microbial, viral, plant and animal species. The species list included hits to endangered, threatened and invasive species. Species names are shown along the right-hand side. Vertical colored bars show the number of DNA reads assigned to each species in each of the four samples. For some species there are clear differences between the sample sites, whereas for others it’s fairly uniform.
Interestingly, near the bottom of the report there is a category of “Not assigned” DNA reads. Despite the very large database query, there is still a substantial number of reads that were not assigned to any species. This could be due to the fact that these species are not represented in the NCBI Taxonomy database. As a further check on this, we’re running de novo assembly on the Unassigned reads to see if we can construct (partial) genes, genomes, or DNA sequences that match any known species.
The Titan Supercomputer is truly an extraordinary machine. It allowed us to run large-scale BLAST queries of metagenomic NextGen sequencing data against 45M-record taxonomic databases. Petaflop-scale runs with multi-quadrillion sequence alignments were the norm. The Titan runs identified several thousand species in metagenomic water samples from the Florida Everglades. And yielded some intriguing “Unassigned” DNA reads for further analysis.