This Blog is Moving!

For those of you who are subscribed to and following my blogsite, it is moving to a new website effective immediately.  The new website is:

For continuity, all of my previous blogs have been migrated to the new site.  And I will post all new blogs to the site as well.  I hope you join me there!

View Richard Casey's profile on LinkedIn


Genome Assembly without the RAM


, , , ,

(De novo assembly of Next Generation Sequencing (NGS) DNA reads)

Assembling genomes with little memory.

De novo assembly is the construction of genomes without reference to existing known genomes. There is a wide variety of de novo assembly tools available for constructing genomes.  It’s well-known that many of these tools require substantial amounts of memory for assembling “large” genomes.  For these assemblers the definition of “large genome” usually refers to something greater than about 1 Gb (gigabase, 1B nucleotides), although there are genomes that exceed this length by a considerable amount.

From experience we know that some assemblers require several hundred GB (gigabytes) up to 1 TB (terabyte) of RAM, per compute node, in a cluster computing environment, in order to assemble large genomes.  This can be very expensive.  Therefore, we’re evaluating a relatively new de novo assembler, Minia, as an alternative to some of the traditional assemblers.  Apparently, Minia can assemble human scale genomes with about 5 GB RAM.  Let’s see what our evaluation discovered.


Minia source code is available here.  A useful site for instructions on running Minia is  Papers by Chikhi and Rizk and Salikhov et al. describe the Minia algorithms.

Sample Data

We’ll assemble the genome of James D. Watson.  Watson’s genomic DNA sequences can be obtained from the NCBI Sequence Read Archive (SRA) via DNAnexus Sequence Read Archive+ and the NCBI ftp server, as follows:

  • SRA Project ID: SRP000095
  • NCBI download:  (login as Guest)
  • Read info: 74,198,831 DNA reads
  • Project title: “The complete genome of an individual by massively parallel DNA sequencing.”  See Wheeler et al.

(The Wheeler et al. paper claims ca. 106.5M reads were generated from their NextGen Sequencing experiments.  However, we only count ca. 74M reads in the NCBI repository.)

System Configuration

We ran Minia on the Colorado State University Cray supercomputer with the following system configuration and Minia run parameters:

  • Cray XT6m
  • (2) 12-core AMD Opteron 6100 CPU’s per compute node
  • 32 GB RAM per compute node
  • 1 CPU-thread
  • k-mer size = 31
  • min abundance = 3

We ran Minia single-threaded on one CPU core on one compute node.  The goal of our benchmark runs was to test RAM utilization, not wall-clock performance or scalability per se, hence the choice of single-core runs.  The chosen k-mer size was based on recommendations in Minia documentation for a 3 Gb target genome.  “min abundance” removes erroneous, low abundance k-mers; the documentation suggests a value of 3.

Side note.  Perusing the source code, we see some OpenMP pragma’s

#if OMP
#pragma omp parallel
   if(use_compressed_reads && ! separate_count)

but it’s not clear if parallelization is still in development or production ready.  Therefore, we thought it prudent to run on a single CPU core, instead of multi-core, for the RAM tests.


The results of the Minia de novo assembly run were:

  • Total wall clock time: 40.9 hrs.
  • Max. memory: 2.1 GB

Watson’s genome assembly required 40.9 wall clock hours of run time.  Minia reported several key phases during the run including “counting k-mers” (4.1 hrs.), “writing k-mers” (2.3 hrs.), “Debloom step” (8.3 hrs.), “assembly” (24.0 hrs.), and some miscellaneous steps.  This is a fairly long run but keep in mind that we’re only using a single CPU core for assembly.

The reported RAM usage was 2.1 GB.  Minia documentation estimates ca. 1 – 2 GB RAM consumption per 1 Gb of target genome size.  Human genome is ca. 3.1 Gb, so we would expect about 3 – 6 GB of RAM utilization for Watson’s assembly.  The reported metric of 2.1 GB was somewhat below the predicted value.

Assembly Metrics

We’ll use Quast to examine the Watson genome assembly metrics.

Minia Assembly Metrics
N50 816 bp
Longest contig 9,223 bp
Total # contigs 8,213,731
Total # contigs >= 1,000 bp 218,513
Total length 2,088,316,147 bp
Total length >= 1,000 bp 289,355,173 bp


Comparing these results with Chikhi and Rizk Table 2, we see some interesting trends: a) N50 is similar to Minia, ABySS, and SOAPdenovo assembly of another human genome, b) the longest contig is decent at 9K bp but somewhat shorter than Minia, C&B (now Gossamer), and ABySS, c) total number of contigs was somewhat greater than the C&R paper, and d) total length was about the same as the C&R paper.  Overall, the Minia Watson assembly metrics were pretty much in-line with the assemblers mentioned in the C&R paper.


If your computational systems have limited amounts of RAM, you may want to consider using Minia for de novo assembly.  We assembled human genome scale NGS datasets (3 Gb) with ca. 2 GB RAM, which is a RAM utilization of one to two orders of magnitude less than most assemblers.

We encourage the Minia developers to include a scaffolder tool and assembly metrics tool in the Minia package.  Hunt et al. offer a comparative analysis of candidate scaffolding tools.  The Assemblathon proceedings (Earl et al.Bradnam et al.) offer some recommendations on assembly metrics. These two additions would be highly beneficial for end users.

Based on the Minia source code, there appears to be an effort to parallelize the application.  The documentation does not mention parallelization so we have to assume it’s not ready for production release.  A faster parallelized version of Minia would be useful for assembling large genomes.

Given the present state of development for Minia, we would recommend that it be used with some caution, and that any results be independently validated.  However, the very low RAM requirements make Minia an intriguing alternative to other assemblers.


View Richard Casey's profile on LinkedIn



, , ,


“Accelerating GPU software development”

Nvidia recently announced the availability of CUDA 7 Release Candidate (RC).  The production version should follow soon.  CUDA 7 continues the long evolution of the CUDA language and products for Nvidia Tesla GPU’s.  CUDA 7 is available in the current version of CUDA Toolkit.  You’ll have to be a CUDA Registered Developer to get the RC version, but of course the production version will be available to the general public.


CUDA 7 includes many new capabilities that should improve the development and performance of GPU-enabled applications.

  • C++11 Support.  C++11 was released a few years ago (2011).  It includes a long list of new language extensions, Standard Template Library (STL) enhancements, and language usability improvements.
  • Thrust v.1.8.  Substantial performance improvements in various Thrust algorithms, such as sort, merge, scan and others.  Speedups ranging from 15% – 300%.
  • cuSOLVER.  A new numerical library for sparse and dense linear solvers.
  • cuFFT.  Up to 3X performance improvements for Fast Fourier Transform routines.
  • Runtime Compilation Library.  Permits highly-tuned, data-dependent runtime compilation of code.
  • GPU Core Dumps.  Improves code debugging by providing GPU core dumps along with CPU core dumps.
  • Memcheck Tools.  Minimize software defects by tracking and reporting uninitialized global memory.
  • IBM Power8.   Supports the new IBM Power8 CPU architecture.

We’ll be applying CUDA 7 in our ongoing development efforts to GPU-enable Next Generation Sequencing data analysis pipelines.  Stay tuned for updates on this endeavor.


View Richard Casey's profile on LinkedIn

Human Genome on K80 GPU


, , , ,


(Human Genome derived from UCSC Genome Browser)

Next Generation Sequencing (NGS) technology is advancing at a rapid pace.  New DNA sequencers from Illumina and ThermoFisher/LifeTechnologies can generate tens of millions to several billion reads (DNA/cDNA fragments) in a few hours or days.  With barcoding and multiplexing, several samples can be run on a single chip or lane. This combination of high read count and multiple samples per run presents a real challenge for the bioinformatics data analysis community.

NGS data analysis often includes a time-consuming sequence alignment step.  Typically, the DNA reads coming off the sequencers are aligned to a so-called reference genome.  This step attempts to map sample reads to the reference genome, in which the results are used for SNP analysis and other types of reports.  The current human reference genome contains about 3.1 billion nucleotide bases.  This is a fairly large genome (although there are considerably larger genomes in other species).  Sequence alignment of NGS sample reads against the reference human genome can take a few hours to several hours to run on a reasonably high-end server or cluster.  Although this is not excessive for a single sample, with the shear number of samples handled by current sequencers, and with multiple sequencers in an NGS Core or lab, the accumulated time spent in the sequence alignment step can be problematic.

To help alleviate this alignment issue, we have been evaluating parallel sequence alignment algorithms and software alignment tools.  As mentioned in an earlier blogpost, we’re testing the NVBIO suite on Nvidia GPU’s.  nvBowtie is a GPU-enabled sequence alignment tool in this software suite.  bowtie2 is a popular CPU-only counterpart to nvBowtie.  To compare the performance of these two applications, we performed a DNA sequence alignment benchmark test with human genome datasets.  The results are presented here.

Sample Data

We obtained a sample test case from the NCBI Sequence Read Archive (SRA).  The SRA contains raw sequencing data from a variety of NGS vendors, DNA sequencing instruments, and research centers.  I recommend using the DNAnexus Sequence Read Archive+ query interface to search SRA; it’s considerably more user-friendly than the direct SRA interface.

For the benchmark tests, we used SRA data from the following project:

  • SRA run ID: SRR1058066
  • NGS platform: Illumina HiSeq 2500
  • Read info: 44,433,542 paired-end reads
  • Species: Homo sapiens
  • Project title: “Targeted next-generation sequencing of head and neck squamous cell carcinoma identifies novel genetic alterations in HPV and HPV-tumors.”  For more information about the study see Lechner et al.

We used human genome build hg38 as the reference genome.  Therefore, we ran sequence alignments of SRR1058066 against hg38.

Nvidia Tesla K80 GPU

(Tesla K80 GPU)

The Tesla K80 is Nvidia’s newest GPU in the Tesla series. The K80 specifications are:

  • 4,992 GPU cores
  • 24 GB GDDR5 RAM
  • 480 GB/sec. memory bandwidth
  • 300 W power consumption (important spec!)

We ran the benchmark tests on a single K80 GPU in an Intel processor cluster.

Sequence Alignment Tools

For the benchmark comparisons we used nvBowtie v. from the NVBIO suite and bowtie2 v.2.2.4 sequence alignment tools.  nvBowtie is designed for highly-parallel GPU-only sequence alignments, whereas bowtie2 is designed for moderately-parallel CPU-only alignments.  In a sense this is a comparison of fine-grained vs. coarse-grained parallelism.

The numerical simulations needed for this work were performed on Microway’s Tesla GPU Test Drive accelerated compute cluster.  We thank Microway for providing access to a GPU-enabled cluster to run the benchmark codes.

CPU-only Test

For the CPU-only tests, we ran bowtie2 with the following system and run configurations:

  • Cray XT6m
  • (2) 12-core AMD Opteron 6100 CPU’s per compute node
  • 32 GB RAM per compute node
  • 12 CPU-threads
  • Ran bowtie2 on a single compute node

The results of the CPU-only sequence alignment runs were:

  • 206 min. or 3.4 hrs.

GPU-only Test

For the GPU-only tests, we ran nvBowtie with the following system and run configurations:

  • Intel cluster
  • (2) 12-core Xeon E5-2680v3 CPU’s per compute node
  • 128 GB host CPU RAM per compute node
  • (1) Nvidia Tesla K80 GPU per compute node
  • Ran nvBowtie on a single compute node

The results of the GPU-only sequence alignment runs were:

  • 16 min. or 0.25 hrs.


Comparing the CPU-only bowtie2 run versus the GPU-only nvBowtie run, the recorded speedup was:

  •  206 min. / 16 min. = 12.8X

The chart below shows the speedup of nvBowtie vs. bowtie2.

(Speedup of human genome sequence alignment with CPU-only bowtie2 vs. GPU-only nvBowtie)

The 12.8X speedup of nvBowtie on a K80 GPU is encouraging.  For human genome sequence alignments, this reduced the wall-clock run time from several hours to a few minutes.

The older AMD Opteron series processors used in these tests have been superseded by newer AMD FX series and Intel 5th generation processors, among others.  The speedups seen here would undoubtedly be reduced somewhat when compared to newer CPU processor models.  However, nvBowtie is currently at version 0.9 (not even version one yet).  We would expect continued algorithm development and optimization in nvBowtie to result in performance improvements for the code, thus maintaining speedups somewhere in the 8X – 10X range.  In any event, shaving hours off the sequence alignment runtimes is significant.

NGS Pipeline Updates 

In our NextGen Sequencing Core we manage numerous bioinformatics pipelines for the analysis of NGS datasets, including:

  • DNA-Seq
  • RNA-Seq
  • CHiP-Seq
  • Methyl-Seq
  • Metagenomics
  • Metatranscriptomics

In each pipeline the main computational bottleneck is the DNA (cDNA) sequence alignment step.  This step can take a few hours to several hours to run on our high-end servers and clusters.  In a busy NGS Core with a large number of samples to process, this can be a significant impediment for data analysis. Given the encouraging results from the nvBowtie benchmark runs described here, we are migrating our sequence alignment tools to NVBIO.  We’ll run A/B-tests with bowtie2 and nvBowtie to compare performance and numerical results between the two sequence alignment tools.  And we’ll see if this improves the turnaround time for bioinformatics data analysis.


View Richard Casey's profile on LinkedIn

Titan-Scale Metagenomics


, , , ,


(Oak Ridge Titan Supercomputer)

“What’s in the water?”

Metagenomics can help answer this question.  Metagenomics is the study of the genetic makeup of communities of organisms.  The communities can be derived from various environmental sources, including water samples, soil samples, internal organs or microbiomes, bioreactors, etc. – basically any source where multiple species of organisms reside.

Metagenomics has become an important research branch in Next Generation Sequencing (NGS).  In NGS, a primary goal of metagenomics studies is to identify the species composition of environmental samples, in other words, to answer the question “what species are in the samples”.  Recently, our NGS Core was engaged in an exciting metagenomics project that eventually engaged the services of the world’s largest public supercomputer, which we’ll describe here.


The Florida Everglades National Park is essentially a slow shallow river flowing southward from Lake Okeechobee to the Florida Bay.


(Florida Everglades)

It’s a vast ecosystem of marshlands, estuarine swamps, and freshwater sloughs covering 1.5M acres in southern Florida.  A variety of endangered and threatened species inhabit the park including American crocodile, American alligator, sea turtles, and numerous birds, plants, insects, and mammals.  A similar number of invasive species are encroaching on the Park including several Python species, feral pigs, and a variety of exotic plant, fish, and invertebrate species.

Recently, the USDA National Wildlife Research Center (NWRC) in Fort Collins, CO, conducted a metagenomics study in order to identify endangered and invasive species in Everglades watersheds.  Dr. Toni Piaggio at NWRC was the principal investigator for this research.  The main goal of the study was to see if Next Generation Sequencing technology could be used to identify what species were present in water samples at various locations in the Park.  Presumably, DNA shed from individuals coursing through  water may be picked up by ultra-sensitive NextGen DNA sequencers.  We tested this hypothesis in the NextGen Sequencing Core at Colorado State University.

NextGen Sequencers


(Ion Proton DNA Sequencer)

The NextGen Core manages several DNA sequencers including the Life Technologies Ion Proton.  The Proton generates about 120M “reads”, or DNA fragments, per sample, which is just sufficient to perform metagenomic data analysis.  The Proton uses an intriguing semiconductor sequencing technology to generate DNA fragments.


(Semiconductor Sequencing Chip)

By leveraging the same CMOS integrated circuit design used in cell phones and other devices, the Proton can sequence DNA samples quickly, usually under 20 hours per run.  It generates so-called “long reads” around 200 base pairs, which is important for subsequent data analysis.  For metagenomics research, longer reads generally yield better results so we like to see read lengths > 100 base pairs if possible.

Metagenomics Pipeline

(Metagenomics Data Analysis Pipeline, adapted from Huson et al.)

Over a period of several years we’ve developed a metagenomics pipeline for the analysis of NextGen sequencing data, shown above.  We tweak the pipeline for specific projects depending on the goals of the research.  For the Everglades project, we ran water samples on the Proton sequencer generating on average 100M reads per sample.  Initially, we used mpiBLAST and the NCBI nucleotide database (nt) on the CSU Cray XT6m supercomputer for running DNA sequence alignments.  mpiBLAST is an MPI-enabled parallelized version of the traditional BLAST algorithm for sequence alignments.  The nt database includes DNA sequences from a broad and diverse collection of species including eukaryotes and prokaryotes.  The BLAST alignment results are fed to MEGAN, a software tool for species identification and taxonomic classification.

(CSU Cray Supercomputer

Computational Requirements

It became apparent to us that mpiBLAST and the Cray XT6m were insufficient to handle the very large number of sequence alignments required by Proton sample sizes and the size of the nt database.  The full nt database contains around 15M DNA fragments; the metagenomic sample datasets each contain about 100M reads.  To perform an all-for-all database query would require 15M X 100M = 1.5e15 (1.5 quadrillion) sequence alignments.  Utilizing 500 CPU cores on the XT6m for week-long runs, we could only query about 5% of the nt database.  Clearly, this was well short of our goal of full queries of the nt database.

We also discovered that mpiBLAST did not scale well beyond a few hundred CPU cores.  We expected strong scaling but instead found mpiBLAST going nearly asymptotic, thus limiting the number of sequence alignments achievable on the XT6m.

To improve database query and scaling performance, we received a Director’s Discretion grant on the Oak Ridge Titan Cray XK7 Supercomputer.  The Titan includes about 300K CPU cores, considerably larger than the XT6m, which has about 2K CPU cores.  We wanted to explore the capabilities of the Titan and its application to DNA sequence alignment algorithms.


(Titan Supercomputer)

Furthermore, our colleagues at Oak Ridge recently developed a new BLAST tool, HSP-BLAST (Highly Scalable Parallel BLAST), with vastly improved performance characteristics over mpiBLAST.  HSP-BLAST showed nearly linear scaling with problem size, which simply means that algorithm speedup increases linearly with the number of CPU cores, i.e. doubling the number of CPU cores doubles the speedup of the code.  This is particularly important when scaling up to, say, 100K cores on very large datasets.  So we switched over to HSP-BLAST for all of our sequence alignment runs and large-scale database queries.

The combination of Titan hardware and HSP-BLAST software proved to be considerably faster than we anticipated.  Consequently, for the Titan BLAST runs we concatenated several databases into a single monolithic database, which included the NCBI Nucleotide DB (GenBank, RefSeq), NCBI Environmental Nucleotide DB, Silva DB, Greengenes DB, and RDP DB. The combined database had approximately 45M records. The NCBI Nucleotide database includes DNA sequences from a broad and diverse group of species including eukaryotes and prokaryotes. As the name implies, the NCBI Environmental Nucleotide database consists mostly of DNA sequences from environmental samples. The Silva, Greengenes, and RDP databases include primarily ribosomal sequences that are useful in identifying microbial species.

The largest BLAST query was run on 100K CPU cores, about 1/3 of the Titan, consuming about 9-hours of time.  However, most BLAST queries were on the order of a few thousand to a few tens of thousands of CPU cores; typically these jobs ran for a few to several hours.  We saw near-linear scaling of HSP-BLAST in all runs, a very encouraging sign for this sequence alignment tool.

 Species Identification

For the data analysis step, we imported Titan BLAST sequence alignment results into MEGAN.  MEGAN assigns DNA reads to the NCBI taxonomy, which currently stands around 1M species.  With MEGAN you can drill-up (or roll up) to the Kingdom phylogenetic level, or drill-down to species level.

The figure below shows the results of Titan BLAST metagenomic water samples at the species level for four sample sites in Everglades National Park.


(MEGAN Phylogenetic Classification)

Several thousand species are represented by the left-hand vertical column, including archaea, prokaryotes, eukaryotes, microbial, viral, plant and animal species. The species list included hits to endangered, threatened and invasive species.  Species names are shown along the right-hand side.  Vertical colored bars show the number of DNA reads assigned to each species in each of the four samples.  For some species there are clear differences between the sample sites, whereas for others it’s fairly uniform.

Interestingly, near the bottom of the report there is a category of “Not assigned” DNA reads.  Despite the very large database query, there is still a substantial number of reads that were not assigned to any species.  This could be due to the fact that these species are not represented in the NCBI Taxonomy database.  As a further check on this, we’re running de novo assembly on the Unassigned reads to see if we can construct (partial) genes, genomes, or DNA sequences that match any known species.


The Titan Supercomputer is truly an extraordinary machine.  It allowed us to run large-scale BLAST queries of metagenomic NextGen sequencing data against 45M-record taxonomic databases.  Petaflop-scale runs with multi-quadrillion sequence alignments were the norm.  The Titan runs identified several thousand species in metagenomic water samples from the Florida Everglades.  And yielded some intriguing “Unassigned” DNA reads for further analysis.


View Richard Casey's profile on LinkedIn

Next Generation Sequencing Core



(Cancer genomics CIRCOS diagram)

Promoting the advancement of genomics research.  In this blogpost I’ll briefly describe the Infectious Disease Research Center (IDRC) Next Generation Sequencing Core (NGS) at Colorado State University (CSU).  We formed the Core about 5 years ago to provide genomics services for University and external researchers.  The Core has evolved along with NGS itself, a field of research less than 10 years old.  And there are some exciting recent developments that substantially enhance the Core’s capabilities.


Currently, the Core manages four NextGen sequencers.  Some basic specs for each are listed here.  These are just averages because the specs can vary considerably depending on what type of sequencing run is chosen and the quality characteristics of an individual run.  (The term “reads” means “DNA fragments”, where the fragments range in length from a few tens to a few hundred nucleotide bases.  Gb = “gigabases”. M = “million reads”.  bp = “nucleotide base pairs”)


 Illumina NextSeq 500

Output: 120 Gb
Number of reads: 400 M
Ave. read length: 2 X 150 bp
Ave. run time: 10 – 30 hrs.



Illumina MiSeq

Output: 15 Gb
Number of reads: 25 M
Ave. read length: 2 X 150 bp
Ave. run time: 5 – 50 hrs.



 Ion Proton

Output: 10 Gb
Number of reads: 100 M
Ave. read length: 150 bp
Ave. run time: 15 – 20 hrs.



Ion Torrent (PGM)

Output: 1 Gb
Number of reads: 12 M
Ave. read length: 350 bp
Ave. run time: 2 – 7 hrs.

Laboratory Services

The Core lab provides standard NGS services including:

  • Sample quality control (Bioanalyzer)
  • Library prep (single-end, paired-end, mate-pair)
  • Template prep
  • Barcoding / multiplexing
  • qPCR target enrichment
  • DNA & RNA (cDNA) sequencing
  • Ribodepletion

The lab also runs custom NGS protocols where feasible.

Bioinformatics Services

The Core provides a set of bioinformatics pipelines for researchers who want to run their own NGS data analysis.  Pipelines are available for DNA-Seq, RNA-Seq, CHiP-Seq, Methyl-Seq, metagenomics, metatranscriptomics, and de novo assembly.  Today, there are no real “standard” pipelines for bioinformatics research.  Instead, there is a more-or-less community consensus about the key features and methods that should be present in any particular pipeline. Our pipelines are built around these consensus methods, along with various changes suggested by the local research community.

There are several efforts underway to standardize NextGen Sequencing protocols, such as the ENCODE Consortium, FDA, HDx Reference StandardsAssociation of Biomolecular Resource Facilities, and others.  We’re still a long way from setting universally-accepted standardized protocols for NGS, but these efforts are a step in the right direction.  Where it makes sense, we incorporate some of the recommendations of the standards bodies in our own protocols.

For many researchers NGS is a new and possibly unfamiliar field.  The Core offers training services in bioinformatics methods, pipelines, and software applications.  A new generation of college students have access to state-of-the-art NGS facilities, training, and educational material.

Here is a simplified flowchart of a typical DNA-Seq pipeline.


Similar, and more extensive, flowcharts are available for this and other pipelines.  Researchers can use the pipelines “as-is” or tweak them for their own particular needs.

During the 5-year period the Core has been in existence, we’ve offered services at the request of the research community.  Primarily this includes:

  • DNA-Seq
    • Whole genome sequencing
    • Targeted resequencing
    • SNP’s (single nucleotide polymorphisms)
    • Structural rearrangements (insertions, deletions, inversions, translocations, copy number variation, etc.)
  • RNA-Seq
    • Whole transcriptome
    • Differential gene expression
    • Alternative splicing
    • smRNA, miRNA
  • Metagenomics
    • Genomics of species communities
    • Environmental samples
    • Taxonomic analysis
    • Functional analysis
    • Biochemical pathway analysis
  • Metatranscriptomics
    • Transcriptomics of species communities
    • Environmental samples
    • Functional analysis
    • Biochemical pathway analysis
  • De novo assembly
    • Genome assembly without reference genome
    • Transcriptome assembly without reference transcriptome

These services have been applied to a wide variety of species – archaea, prokaryotes, eukaryotes, plant and animal.  In the past few years the number of draft or fully assembled reference genomes has increased significantly for many species. We’ve seen a corresponding rise in requests for sequencing non-model organisms for which reference genomes did not exist even a few years ago.  This trend should continue in the future.

Computational Resources

One of the salient features of NextGen Sequencing is the massive amount of data generated by the sequencers.  This is no trivial issue, and one that will be exacerbated with new sequencers coming to market.  The Core provides data management services such as file storage, file archives, and IT support.

The Core has access to a local Cray XE6 supercomputer with significant bioinformatics capabilities.  The Cray is used primarily for metagenomic, metatranscriptomic, and de novo assembly work.  Sequence alignments, de novo assembly, and large-scale database queries are often the most computationally intensive steps in bioinformatics pipelines.  The Cray excels at running highly parallelized alignment and assembly algorithms, thus saving researchers a considerable amount of time in getting results.


The Core also manages numerous clusters, high-end servers, file servers, database servers, and archive servers to handle data management activities.  And the Core is exploiting various remote Cloud Computing services.  Google Cloud has proven to be a particularly cost effective solution for data storage and file archives.

Complementing the hardware is an extensive set of bioinformatics data analysis software, too numerous to list here.  A real challenge in the bioinformatics field is the shear number and variety of software applications and databases that one needs to consider. So we’re continuously evaluating and adding new software and database resources to the Core.

Shameless Plug

If you’re interested in NextGen Sequencing and want more information about the CSU NGS Core, contact us at 970-491-8829 or


View Richard Casey's profile on LinkedIn

Computational Ribodepletion


, ,


Yeast 80S rRNA.

Ribosomal RNA (rRNA) is an unloved molecule, at least in the Next Generation Sequencing (NGS) crowd.  Before sequencing and after sequencing we make heroic efforts to remove as much rRNA from samples as possible.  For NGS studies, rRNA is often viewed as a contaminating compound.  And it’s elimination from NGS samples and data analysis is a primary concern for researchers.

Prior to sequencing, we use a variety of molecular ribodepletion methods to remove rRNA from samples. Typically, these methods remove 75% – 95% of rRNA from the initial samples.  We remove rRNA so as to expose the remaining RNA species – mRNA, non-coding RNA’s, small RNA’s, and so forth – for library prep, PCR, and sequencing itself.

After sequencing, there is almost always some remaining rRNA to deal with in the resulting datasets. If there is no further post-processing to remove rRNA, then rRNA reads (cDNA fragments) will be included in any downstream sequence alignment or de novo assembly steps, something we would like to avoid.

In our bioinformatics and data analysis pipelines, we now include a computational ribodepletion step to filter out any remaining, post-sequencing ribosomal RNA.  We’ve been evaluating and testing a relatively new tool, SortMeRNA, for this ribodepletion step.  A brief description of our experience with it follows.


Check the SortMeRNA home page for binary, source code, and GitHub versions of the code.  Documentation, a paper, and some supplementary material are available. SortMeRNA can process single-end reads and paired-end/mate-pair reads.

The following rRNA reference databases are included in the current release of SortMeRNA:

  • Rfam 5S
  • Rfam 5.8S
  • Silva 16S archaeal, bacterial
  • Silva 18S eukaryote
  • Silva 23S archaeal, bacterial
  • Silva 28S eukaryote

Rfam and Silva are well-known, established resources for ribosomal sequence data. You can choose any or all of these databases in your filtering step, or add additional custom rRNA databases of your own choosing.  In the v.2.0 release, there are 105,560 rRNA sequences represented in the Rfam/Silva databases.

In SortMeRNA you can adjust the Blast E-value to set the rigorousness of sequence alignments.  We found the default value of E = 1 to be far too lax for any meaningful results.  Thus, we typically use E = 1e-20 instead, which is an appropriate value for the relatively small databases included with the applications.  (See Blast by Ian Korf et al., 2003, for an extensive discussion on the choice of Expect-values.)


We typically run SortMeRNA on a local Cray supercomputer as a part of larger bioinformatics pipelines.  In case you’re interested, a Cray PBS batch script looks something like this:

#PBS -N sortmerna
#PBS -l mppwidth=8
#PBS -j oe
#PBS -q medium


export EVALUE=1e-20
export DBPATH=/apps/sortmerna-2.0/rRNA_databases

# run sortmerna
time aprun -n1 sortmerna -e $EVALUE
--ref $DBPATH/rfam-5.8s-database-id98.fasta, \
      $DBPATH/rfam-5.8s-database-id98.idx \
--reads   reads.fasta \
--aligned matched-rfam-5.8s \
--other   rejected-rfam-5.8s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/rfam-5s-database-id98.fasta, \
      $DBPATH/rfam-5s-database-id98.idx \
--reads   reads.fasta \
--aligned matched-rfam-5s \
--other   rejected-rfam-5s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-arc-16s-id95.fasta,\
      $DBPATH/silva-arc-16s-id95.idx \
--reads   reads.fasta \
--aligned matched-silva-arc-16s \
--other   rejected-silva-arc-16s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-arc-23s-id98.fasta, \
      $DBPATH/silva-arc-23s-id98.idx \
--reads   reads.fasta \
--aligned matched-silva-arc-23s \
--other   rejected-silva-arc-23s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-bac-16s-id90.fasta, \
      $DBPATH/silva-bac-16s-id90.idx \
--reads   reads.fasta \
--aligned matched-silva-bac-16s \
--other   rejected-silva-bac-16s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-bac-23s-id98.fasta, \
      $DBPATH/silva-bac-23s-id98.idx \
--reads   reads.fasta \
--aligned matched-silva-bac-23s \
--other   rejected-silva-bac-23s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-euk-18s-id95.fasta, \
$DBPATH/silva-euk-18s-id95.idx \
--reads   reads.fasta \
--aligned matched-silva-euk-18s \
--other   rejected-silva-euk-18s \
--blast 1 --fastx --log &

time aprun -n1 sortmerna -e $EVALUE \
--ref $DBPATH/silva-euk-28s-id98.fasta, \
      $DBPATH/silva-euk-28s-id98.idx \
--reads   reads.fasta \
--aligned matched-silva-euk-28s \
--other   rejected-silva-euk-28s \
--blast 1 --fastx --log &


The script simply launches and runs eight SortMeRNA jobs, one for each rRNA database, in parallel on eight separate threads/cores.  This type of job level parallelism is often called “ensemble computing”.  It completes the rRNA filtering step about 8X faster than an equivalent set of serial jobs.  A similar job script should work on most clusters, saving a considerable amount of run time.

Metatranscriptome Pipeline Update

In a previous blogpost I described a metatranscriptome pipeline for the analysis of metatranscriptome data.  This earlier version of the pipeline did not include a post-sequencing ribodepletion step.  We have now added this step to the pipeline, and in fact will be including it in several other bioinformatics workflows.

Here is an updated version of the metatranscriptomics pipeline including computational ribodepletion.  Computationally demanding steps are highlighted in pink.


The main steps in the pipeline are:

  1. Install all software components
  2. Gather FASTQ files of cDNA reads from NextGen sequencers
  3. If the sequencer run used paired-end or mate-pair protocol, combine left and right reads in separate files; otherwise continue for single-end reads
  4. Computational ribodepletion using SortMeRNA
  5. Assemble reads into transcripts using Trinity RNA-Seq application
  6. Check statistical metrics from assembly step
  7. Estimate transcript abundance with RSEM application
  8. Identify significantly differentially expressed genes with edgeR
  9. Perform TMM (trimmed mean of M-values) normalization
  10. Perform hierarchical clustering of transcripts

The new post-sequencing ribodepletion step is shown as a computationally intensive step, along with the Trinity de novo assembly and Bowtie sequence alignment steps. We were seeing seven hour run times for ribodepletion on several projects.  The ribodepletion step will become another bottleneck in the pipeline with the massive datasets coming from new NextGen sequencers.


SortMeRNA generates two filetypes – matched and rejected – in FASTA and BLAST formats.  Matched records are sample reads that show significant alignments (low E-values) to rRNA database entries.  Rejected records are sample reads that do not align to rRNA sequences.  You can use matched reads to filter out rRNA sequences from post-sequencing datasets.  After filtering, the remaining reads should be clear of rRNA sequences, which improves downstream data analysis.

In one of our metatranscriptomics projects, we included the SortMeRNA computational ribodepletion step.  The results for four samples are shown here.  The first column represents each of the eight rRNA databases; the second column shows the percentage match of sample reads to ribosomal RNA sequences.

Sample 1

rfam-5.8s:     0.05%
rfam-5s:       0.01%
silva-arc-16s: 1.56%
silva-arc-23s: 6.39%
silva-bac-16s: 1.09%
silva-bac-23s: 2.70%
silva-euk-18s: 0.02%
silva-euk-28s: 0.02%

Sample 2

rfam-5.8s:     0.07%
rfam-5s:       0.52%
silva-arc-16s: 2.14%
silva-arc-23s: 7.16%
silva-bac-16s: 2.68%
silva-euk-18s: 0.43%
silva-euk-28s: 0.16%

Sample 3

rfam-5.8s:      0.29%
rfam-5s:        1.84%
silva-arc-16s:  0.01%
silva-arc-23s: 13.47%
silva-bac-23s: 40.37%
silva-euk-18s:  0.02%
silva-euk-28s:  0.07%

Sample 4

rfam-5.8s:      0.38%
rfam-5s:        7.46%
silva-arc-16s:  0.05%
silva-arc-23s:  4.56%
silva-bac-23s: 21.47%
silva-euk-18s:  0.51%
silva-euk-28s:  0.07%

Some notable observations:

  • Samples 1 and 2 show modest but still significant levels of rRNA remaining in each sample after sequencing.  Samples 3 and 4 show substantial levels of rRNA still left after sequencing.  Apparently pre-sequencing ribodepletion methods were more effective for Samples 1 and 2 but less so for Samples 3 and 4.
  • Apparently, if SortMeRNA finds no sequence matches to rRNA, i.e. 0.00%, it does not enter a record in its log files.  Sample 1 shows eight matches, one to each database;  Samples 2, 3, 4 show seven matches, with one missing entry in each sample.  We would prefer to see all eight entries with a 0.00% tag if there are no matches.
  • Regardless of the sample, it’s clear that post-sequencing computational ribodepletion is an important step in the bioinformatics data analysis pipeline.  We see a significant number of hits to rRNA sequences, which we now filter out before proceeding with data analysis

Based on our experiences so far with SortMeRNA, we believe it is an effective quality control tool for NGS research.  We’ll be testing it in other projects and bioinformatics pipelines.


View Richard Casey's profile on LinkedIn

Metatranscriptomics Pipeline


, ,


(Figure from Marchetti et al. 2012)

Work at the bleeding edge of Next Generation Sequencing.  Metatranscriptomics is the study of RNA expression and regulation in communities of organisms.  It extends RNA-Seq from individual species to communities of species.  As you might imagine this significantly compounds the difficulty of data analysis; handling one species is challenging enough, let alone tens to potentially hundreds (thousands?) of species.  Metatranscriptomics for NextGen sequencing has been developing steadily over the past few years but there are relatively few studies and papers available in this area.  It’s not a stretch to say that with metatranscriptomics we’re working at the leading edge of NextGen sequencing technology.  It’s exciting and challenging at the same time.

Recently we were asked to run NextGen sequencing on samples of bacterial/viral populations and develop metatranscriptomics pipelines for data analysis of the results.  We ran the sample sequencing on Life Technologies Ion Proton sequencer, which yielded about 30M cDNA reads per sample for transcript analysis. This is a brief discussion of our experience with the data analysis part.

Haas Distro

We developed a metatranscriptomics pipeline derived in part from a paper by Haas et al. The required software applications are listed in the paper.  To get the pipeline to work, you’ll want to use the versions listed here.  We discovered irreconcilable incompatibilites when we tried mixing versions other than those shown below.

After installing the R statistical package, you’ll need to install

within an R session using something like the following commands:

% R
> source(″″)
> biocLite()
> biocLite (′edgeR′)
> biocLite(′ctc′)
> biocLite(′Biobase′)
> biocLite(′ape′)
> install.packages(′gplots′)


With everything installed, you’ll need some fairly hefty computational resources to handle the lengthy sequence alignment and de novo assembly steps.  We used Cray XT6m and Cray XE6 supercomputers for this but any Linux cluster with an adequate number of CPU cores and RAM should suffice. (The required compute systems for the pipeline could increase substantially in the near future with new NextGen sequencers pumping out 109 reads.)

Metatranscriptome Pipeline

Here is a pipeline flowchart.  Computationally intensive steps are shown in pink.



The main steps in the pipeline are:

  1. Install all software components
  2. Gather FASTQ files of cDNA reads from NextGen sequencers
  3. If the sequencer run used paired-end or mate-pair protocol, combine left and right reads in separate files; otherwise continue for single-end reads
  4. Assemble reads into transcripts using Trinity RNA-Seq application
  5. Check statistical metrics from assembly step
  6. Estimate transcript abundance with RSEM application
  7. Identify significantly differentially expressed genes with edgeR
  8. Perform TMM (trimmed mean of M-values) normalization
  9. Perform hierarchical clustering of transcripts

Paired-end or mate-pair reads are preferable for the Trinity de novo assembly step, but single-end reads will also yield reasonably good transcripts.  The statistical analysis step will give some guidance on how good the assembly step was.  RSEM yields estimates of transcript abundance and gene expression levels, which are then used in edgeR to determine significantly differentially expressed genes.  The normalization step is important to adjust expression levels for differences in sequencing depth (number of reads) across samples.  Hierarchical clustering of transcripts is an optional step if you want to group sets of transcripts with similar expression levels.

The two computationally intensive steps involve Trinity and Bowtie.  There are parallelized versions of both applications.  We encourage continued efforts to parallelize and optimize these two applications.   These steps will become real bottlenecks in the pipeline with the massive datasets now coming from NextGen sequencers.


The metatranscriptomics pipeline yields several standard RNA-Seq plots. (Figures below are from Haas et al. 2013).


MA-plots show RNA transcript fold change as a function of transcript abundance.  Significantly differentially expressed genes (p < 0.05) are shown in red.  Up-regulated genes are shown as y > 0; down-regulated genes are y < 0.  High abundance transcripts appear on the right x-axis; low abundance transcripts appear on the left x-axis.  Housekeeping genes are mostly shown in black.  In whole transcriptome research on single species, MA-plots show transcripts just for that individual species.  However, for metatranscriptome studies, MA-plots show transcripts from all species constituting the original samples.  Clearly, in this experiment, there were many significantly up-regulated / down-regulated genes.


Volcano plots show statistical significance levels (false discovery rates (FDR); p-values) as a function of fold change.  Significantly differentially expressed genes with FDR < 0.1% are shown in red.  For metatranscriptomic studies, volcano plots show highly expressed genes from multiple species in the original samples.


Several issues became apparent while developing the pipeline.

  • What species are in these samples? Whenever you’re handling communities of species, which is the case for any of the “meta-” studies – metagenomics, metatranscriptomics, metaproteomics – the species composition of the samples is an open question. NextGen researchers are often surprised at the extreme diversity of species found in their samples. Typically, researchers are interested in a handful of target species and consider everything else to be “contaminating” species or of secondary interest. However, the metatranscriptomics pipeline is non-discriminatory and will process all the RNA (cDNA) from all species in the samples. This is an important distinction from more traditional RNA-Seq research.
  • Alternative splicing.  In the study mentioned here we believe the majority of species were of prokaryotic bacterial and viral origin.  Alternative splicing and isoform detection was thus not much of an issue.  However, in other studies, there may be a substantial number of eukaryotic species present in the samples.  Alternative splicing then becomes a real issue, especially for the de novo assembly step.  Trinity apparently detects alternatively spliced isoforms for single species samples, but how well does it perform with multi-species samples, especially when there are closely related species in the samples?  This is probably an open question that needs more research and validation.
  • Ribosomal RNA.  rRNA often constitutes >90% of sample RNA.  We use various pre-sequencing ribodepletion methods to eliminate as much rRNA as possible.  Nevertheless, there is almost always some remaining post-sequencing rRNA.  In data analysis these are usually easy to identify as they show huge RPKM values, and BLAST’ing those reads against rRNA databases reveal their identity.  In the pipeline described here we had minimal post-sequencing rRNA removal, so rRNA assembly was included in the analysis.  In a (near) future blog I’ll describe some computational methods under development for post-sequencing rRNA extraction.  We’ll update the pipeline with these new methods.


View Richard Casey's profile on LinkedIn

NVBIO – GPU Accelerated Bioinformatics Library


, , ,


Bioinformatics applications have a need for speed.  For example, Next Generation Sequencing (NGS) platforms from Illumina and ThermoFisher/LifeTechnologies can generate massive datasets, requiring substantial computing horsepower for data analysis.  CPU-based systems are often incapable of handling these datasets.  Newer, highly-parallel GPU-based systems can provide the necessary compute power for NGS data analysis.  However, traditional CPU-based bioinformatics algorithms must be ported to and optimized for execution on GPU’s.  Enter nvBio – a new GPU-accelerated bioinformatics library developed by Nvidia for NGS and related Life Sciences applications.


nvBio is available as OpenSource software on Github.  You’ll need a Nvidia GPU card at Compute Capability 3.5 or greater, which at the time of this writing includes the Kepler and Maxwell architectures.  At the 2014 GTC Technology Conference, the nvBio software engineers presented an informative seminar and slideset about the library.  If you want to be more engaged with library developments, you can join the Google group nvbio-users.  Documentation is available for NVBIO.  NVBIO is distributed under GPLv2 licensing so you’ll be able to copy, modify, and distribute it at will.

There are several requirements and dependencies necessary to install the library:

  • CMake. You’ll need the CMake Cross Platform Makefile Generator to generate Makefiles and build executables. Using CMake is beyond the scope of this article but the CMake documentation should help you get started.
  • Thrust. You’ll need the Thrust library, which is presently not included in the NVBIO requirements list, possibly an oversight by the developers.  If you install a recent version of the CUDA Toolkit, it includes Thrust. CMake should be able to find the required Thrust include files assuming Thrust standalone libraries or CUDA Toolkit are properly installed.
  • CUB libraries
  • zlib libraries.  Usually included in Linux OS’s but you may want to install the most current version.
  • CRC libraries

Hopefully, as NVBIO matures, it will include a more comprehensive package manager to better integrate all the dependencies.


NVBIO is designed as a GPU-enabled C++ framework including numerous template, utility, and primitive data structures and algorithms.

  • Algorithms: Burrows-Wheeler Transform, FM-index, Smith-Waterman, Needleman-Wunsch, Gotoh, banded/full dynamic programming, branch and bound search, etc.
  • Text indices & text search: FM-index, suffix tree, radix tree, sorted dictionary, exact string search, approximate string search, backtracking, etc.
  • I/O: Sequence I/O, alignment I/O.  Handled by the host CPU.

Based on the framework code, NVBIO includes several GPU-enabled applications:

Multi-GPU capability was recently added to nvBowtie but expect single-GPU capability only for the rest of NVBIO.  The library includes CPU-only as well as GPU implementations, thus giving you the option of running the library and applications on CPU’s only or on GPU’s.  This could be useful for testing code and making performance comparisons.


A variety of NVBIO performance benchmarks are available.  In the following examples NVBIO routines were run on Intel i7 3.2 Ghz CPU’s versus a single Nvidia K20 or K40 GPU.  (Note that the new K80 GPU was recently announced, about 2X faster than K40).


In this benchmark, BWT-SW and Bowtie2 were compared to nvBWT on human reference genome build hg19.  nvBWT was about 8X and 21X faster than BWT-SW and Bowtie2, respectively.  The metric is time in “hh:mm:ss” format.


In this string search benchmark, both exact string match and approximate string match were run against hg19 with reads derived from the Sequence Read Archive. GPU-enabled exact and approximate matches were about 6X and 4X faster than CPU-only code, respectively. The metric is million queries per second.


In this benchmark, GPU-enabled Smith-Waterman code was compared to CPU-only code.  GPU-enabled Edit Distance code was available but there were no comparable CPU reference metrics; the GPU results are shown anyway to get a sense of performance.  K20 and K40 GPU models were tested, including the K40 in GPU-Boost mode, basically a form of overclocking.  For Smith-Waterman the GPU-enabled code was about 3.5X, 4X, and 5X faster than CPU-only code.  For Edit Distance the GPU-enabled code ran in the range 113 – 162 GCUPS. The metric is billion cell updates per second.

Considering that NVBIO is a new release product at version 0.9, these are impressive performance metrics.  With continued software development, optimization, and multi-GPU capability we can expect the speedup to improve in future releases.

Sample Pipeline

To demonstrate how to use NVBIO, the developers offer a sample pipeline for the nvBowtie short-read aligner.  The pipeline data flow consists of several steps:

  • Map.  DNA or cDNA reads are split into short “seeds” and mapped against a reference genome using FM-index.  Mapped seeds are stored as Suffix Array coordinates.  Mapping options include both exact string matches and approximate matches.
  • Locate.  Suffix Array coordinates are converted to linear genomic coordinates.
  • Select.  Select mapped seeds for seed extension and new scoring rounds.
  • Score.  Score mapped extended seeds with respect to the reference genome.
  • Reduce.  Find best alignments of mapped extended seeds.
  • Traceback.  Perform Smith-Waterman traceback to generate optimal aligned sequences.

In a manner similar to the CPU-only version of Bowtie2, you can run the pipeline by first generating a BWT index of the reference genome and then execute the short-read alignment code.  Suppose you have human reference genome build hg19 in FASTA format and a FASTQ reads dataset.  Then run:

> nvBWT hg19.fasta hg19-index
> nvBowtie –file-ref hg19 reads.fastq reads.bam

to generate a BAM output file for further analysis.

We’ll keep track of NVBIO as it develops and provide update blogs as appropriate.

View Richard Casey's profile on LinkedIn

CUDA Unified Memory


, ,

CUDA Unified Memory

CUDA is the language of Nvidia GPU’s.  To extract maximum performance from GPU’s, you’ll want to develop applications in CUDA.

CUDA Toolkit is the primary IDE (integrated development environment) for developing CUDA-enabled applications.  The main roles of the Toolkit IDE are to simplify the software development process, maximize software developer productivity, and provide features that enhance GPU performance.  The Toolkit has been steadily evolving in tandem with GPU hardware and currently sits at Version 6.5.

One of the most important features of CUDA 6.5 is Unified Memory (UM).  (UM was actually first introduced in CUDA v.6.0).  CPU host memory and GPU device memory are physically separate entities, connected by a relatively slow PCIe bus.  Prior to v.6.0, data elements shared in both CPU and GPU memory required two copies – one copy in CPU memory and one copy in GPU memory.  Developers had to allocate memory on the CPU, allocate memory on the GPU, and then copy data from CPU to GPU and from GPU to CPU.  This dual data management scheme added complexity to programs, opportunities for the introduction of software bugs, and an excessive focus of time and energy on data management tasks.

UM corrects this.  UM creates a memory pool that is shared between CPU and GPU, with a single memory address space and single pointers accessible to both host and device code.  The CUDA driver and runtime libraries automatically handle data transfers between host and device memory, thus relieving developers from the burden of explicitly managing those data transfers.  UM improves performance by automatically providing data locality on the CPU or GPU, wherever it might be required by the application algorithm.  UM also guarantees global coherency of data on host and device, thus reducing the introduction of software bugs.

Let’s explore some sample code that illustrates these concepts.  We won’t concern ourselves with the function of this algorithm; instead, we’ll just focus on the syntax. (Credit to Nvidia for this C/CUDA template example).

Without Unified Memory

#include <string.h>
#include <stdio.h>

struct DataElement
  char *name;
  int value;

void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';

void launch(DataElement *elem) {
  DataElement *d_elem;
  char *d_name;

  int namelen = strlen(elem->name) + 1;

  // Allocate memory on GPU
  cudaMalloc(&d_elem, sizeof(DataElement));
  cudaMalloc(&d_name, namelen);

  // Copy data from CPU to GPU
  cudaMemcpy(d_elem, elem, sizeof(DataElement),
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*),

  // Launch kernel
  Kernel<<< 1, 1 >>>(d_elem);

  // Copy data from GPU to CPU
  cudaMemcpy(&(elem->value), &(d_elem->value), sizeof(int),
  cudaMemcpy(elem->name, d_name, namelen, cudaMemcpyDeviceToHost);


int main(void)
  DataElement *e;

  // Allocate memory on CPU
  e = (DataElement*)malloc(sizeof(DataElement));
  e->value = 10;

  // Allocate memory on CPU
  e->name = (char*)malloc(sizeof(char) * (strlen("hello") + 1));
  strcpy(e->name, "hello");


  printf("On host: name=%s, value=%d\n", e->name, e->value);


Note these key points:

  • L51,55: Allocate memory on CPU
  • L24,25: Allocate memory on GPU
  • L28-32: Copy data from CPU to GPU
  • L35: Run kernel
  • L38-40: Copy data from GPU to CPU

With Unified Memory 

#include <string.h>
#include <stdio.h>

struct DataElement
  char *name;
  int value;

void Kernel(DataElement *elem) {
  printf("On device: name=%s, value=%d\n", elem->name, elem->value;
  elem->name[0] = 'd';

void launch(DataElement *elem) {
  // Launch kernel
  Kernel<<< 1, 1 >>>(elem);

int main(void)
  DataElement *e;

  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&e, sizeof(DataElement));
  e->value = 10;

  // Allocate unified memory on CPU and GPU
  cudaMallocManaged((void**)&(e->name), sizeof(char) *
     (strlen("hello") + 1) );
  strcpy(e->name, "hello");


  printf("On host: name=%s, value=%d\n", e->name, e->value);


Note these key points:

  • L28, 32, 33: Allocate unified memory on CPU and GPU
  • L19: Run kernel

With UM, memory is allocated on the CPU and GPU in a single address space and managed with a single pointer.  Note how the “malloc’s” and “cudaMalloc’s” are condensed into single calls to cudaMallocManaged().  Furthermore, explicit cudaMemcpy() data transfers between CPU and GPU are eliminated, as the CUDA runtime handles these transfers automatically in the background. Collectively these actions simplify code development, code maintenance, and data management.

As software project managers, we like UM for the productivity enhancements it provides for our software development teams.  It improves software quality, reduces coding time, effort and cost, and enhances overall performance. As software engineers, we like UM because of reduced coding effort and the fact that we can focus time and effort on writing CUDA kernel code, where all the parallel performance comes from, instead of spending time on memory management tasks.  Unified Memory is major step forward in GPU programming.

View Richard Casey's profile on LinkedIn