|Values are valid only on day of printing.|
Each human being has over 6 billion base pairs (ie, 6 gigabases) of DNA in each diploid cell in their body. Three gigabases are inherited from our mother and 3 from our father. Contained within this DNA is the unique signature of each individual. The Human Genome Project was begun in 1990 in order to better understand the genetic basis of disease and to derive a draft sequence of the human genome with high accuracy (~99.99%).M1 However, at that time, DNA sequencing was a time-consuming, labor-intensive and expensive endeavor and one of the project’s first priorities was to develop technologies that could completely decode or “sequence” the DNA of an individual in a more cost-effective manner.
In order to sequence a 6 gigabase genome, several convergent technologies needed to be developed. The first, and most critical component, was an automated DNA sequencer that utilized reusable capillaries for processing 96 samples per 3-hour run—the ABI3700 automated sequencer.2 This instrument was state-of-the-art for the time and included hundreds of automated fluorescent DNA sequencers utilizing Sange r sequencing chemistry coupled with high-resolution capillary electrophoresis. Each instrument was capable of generating 650,000 base pairs of DNA sequence per day. The second necessary component was a powerful computer capable of properly positioning each of the 768 reads relative to each other such that a total ordered sequence could be deduced.
The first draft sequences of a human genome were completed in 2001 at a total cost of just under $3 billion. The generation of a draft sequence for an individual provided information on the surprisingly small number of human genes (approximately 20,000), their location, and the curious fact that only 2% of our genome actually codes for protein.3,4 It also provided instantaneous access to the majority of genomic regions from all individuals, which further facilitated the rapid identification of disease-related genes.
But, these first machines were far too expensive and time consuming to routinely resequence entire genomes. They were perfectly suited for investigating individual genes and, thus, spawned the development of a number of molecular diagnostic tests that are routinely used in laboratory medicine today for clinical diagnosis in a variety of fields including molecular genetics, pharmacogenetics, hematopathology, anatomic pathology, and microbiology. In the area of molecular genetics, most sequencing-based tests are done on a single or just a few genes, requiring significant effort. These assays are still sufficiently complicated so that they remain esoteric by nature and limited in scale.
If DNA sequencing is to transform clinical practice, dramatic improvements need to be made in order to bring down the cost of the tests, while not compromising the quality of the data. Recently, a new generation of DNA sequencers has been developed that utilize massively parallel DNA sequencing. In these methods, instead of performing individual sequencing reactions on polymerase chain reaction (PCR)-amplified or cloned fragments of DNA and analyzing them 96 at a time, many sequencing reactions—hundreds of thousands to millions of reactions—are all performed and analyzed simultaneously, at a cost per base far less than what was possible with the fluorescent automated sequencers used on the first human genome project. Today, just 1 of these machines running for under 2 weeks can completely sequence the genome of an individual at a cost of only $10,000. The ability to quickly generate multiple gigabases of DNA sequence at a low cost is expected to completely transform clinical practice and is certain to transform the way we perform many tests within laboratory medicine and pathology. In this article, we review the development of this new generation of massively parallel DNA sequencers and describe some future clinical applications for this technology. Ultimately, high-throughput technology like this will herald the era of individualized medicine.
After the completion of the first draft human DNA sequence using automated capillary electrophoresis, a number of different sequencing platforms began to emerge. These were dramatically different from each other in their sequencing chemistries, but all shared the strength of utilizing massively parallel DNA sequencing to dramatically increase the sequencing output of a single automated sequencing machine. These instruments are now termed Next Generation (or NextGen) sequencers.
The First Next Generation Instruments
The first of these Next Generation DNA sequencers was developed by a company called 454 (now part of Roche Applied Science). Their platform was based on the creation of a specialized picotitre plate made by slicing a fiber-optic cable and coating the bottom of it to produce a plate that contained over 1 million individual wells for discrete picroliter-scale sequencing reactions. In this novel method, individual DNA molecules are amplified on tiny beads in individual lipid droplets, all part of a larger aqueous-lipid emulsion (emulsion PCR). The beads with amplified DNA on them are isolated and deposited by centrifugation into the picotitre plates. Reagents then flow over the picotitre plate and new DNA is synthesized using the amplified DNA as a template. The type of sequencing reaction that is performed on the bead-bound PCR product is termed pyrosequencing. In this type of sequencing chemistry, a flash of light is generated when a base is added to the growing DNA strand by DNA polymerase. A sensitive camera records the flashes of light. Since the target DNA is confined to a specific well in the picotitre plate, and as the bases are added 1 at a time, it is straightforward to correlate the bases being added at a given time to a flash of light at a given place. This is then repeated over and over to generate a pattern of light emission in each well position. A computer then reads the camera images and converts the data to specific sequence information. (Figure 1)
Figure 1. 454 sequencing technology. 454 Sequencing © 2010 Roche Diagnostics, North America. Used with permission.
The original 454 sequencer could generate up to 100,000 base pair reads, or 200 bases per read, for a total of 20 megabases per 12-hour run. The relatively long reads (as compared to other Next Generation methods) provided by this platform made sequence alignments straightforward and similar to traditional Sanger sequencing with capillary electrophoresis automated DNA sequencers. There have been dramatic improvements to the 454 sequencing platform since the launch of the instrument in 2007. These include increasing individual read lengths to 500 base pairs, and picotitre plates with a larger number of wells. The new instrument, the Genome Sequencer FLX (GS FLX) Titanium, can produce 500 million base pairs of DNA sequence in a single 8-hour run. This machine reduces the reagent cost of generating a single human genome sequence to $1 million.
One weakness of this platform is that when stretches of long homopolymers (a run of the same base in a segment of DNA) are encountered, multiple bases are incorporated all at once. For a small number of simultaneous incorporations, it’s easy to determine the true number of events that occurred. However, as the number of identical bases in a consecutive row increases, the ability to accurately discriminate the true number of bases at that location decreases. Also important to realize is that this sequencing methodology has a higher per read error rate than the current gold standard Sanger sequencing. The depth of coverage (the number of times each base is sequenced in a given experiment) helps mitigate the error rate, but it is still a large factor when using this technology for detection of rare mutations, such as those relevant for the early detection of cancer.
Subsequently, the Genome Analyzer (GA), was developed by Solexa Inc, which was later purchased by Illumina Inc. (Figure 2) The GA utilizes a completely different strategy for massively parallel DNA sequencing whereby individual DNA fragments with the appropriate linkers are first amplified on a slide matrix, similar to a microscope slide, using a solid-phase PCR process. The slide with these so-called clusters of amplified fragments is then placed into the Genome Analyzer and the 4 bases are added 1 at a time to the entire slide in repetitive cycles. The GA utilizes fluorescently-labeled terminators that allow detection of fluorescence by a sensitive camera of single-base incorporation events into growing DNA strands. The terminator molecules are designed such that 2 sequential bases cannot be added in the same reagent addition cycle. Thus, this chemistry solves the homopolymer problem encountered with the GS FLX sequencing platform.
Figure 2. Illumina sequencing technology robust reversible terminator chemistry foundation. Illumina, Inc, San Diego, CA. Used with permission
The first generation GA machine could analyze a slide containing 40 to 50 million clusters to generate a total of 1 gigabase of DNA sequence in a 3 to 4 day run. The 2 major limitations of the original GA sequencer were that it was quite slow (capable of sequencing only 5-10 bases per day per template) and that the length of the sequence obtained was quite small—less than 35 base pairs. A third significant problem was that the error rate for this sequencing platform was even higher than that of the GS FLX sequencing platform, and the error rate increased towards the ends of the short DNA sequences read. The shorter read length on this platform further complicated the alignment issue, since it is more difficult to specifically map short sequences to the proper region of the reference genome. The first Illumina experiments required even more sophisticated and powerful computing resources to make sense of the huge number of short reads. Subsequent upgrades to the Illumina GA system came quickly, allowing the current version of the platform to produce high-quality reads upwards of 100 bases in length, enough to simplify the previous version’s alignment issues.
The newest addition to the Illumina next generation sequencing platforms is a completely new machine known as the HiSeq 2000. This instrument, which effectively obsoletes the Illumina Genome Analyzers, will be commercially available in late spring 2010 and will generate a complete genome sequence of 2 individuals on a single 10-day run at a reagent cost of $10,000 per individual. Additional improvements to sample preparation and throughput will allow the HiSeq 2000 to sequence 6 genomes per month.
The third sequencing platform to become commercially available was from Applied Biosciences/Life Technologies (ABI). Similar to the 454 and GS FLX platforms, individual DNA fragments have sequencing primers ligated onto them and then are amplified using emulsion PCR on tiny beads. Beads with amplified DNA on them are then purified and covalently linked to a slide surface. A smaller bead size enabled a much higher number of simultaneous DNA sequences to be analyzed than the previous 2 platforms. The sequencing chemistry used is very different than that used by either the GS FLX or the GA—both of which use a “sequencing by synthesis” procedure made possible by using the enzyme DNA polymerase. The ABI chemistry utilizes sequencing by ligation. In this method, fluorescently-labeled oligonucleotide probes are ligated to the primer, only if they are perfectly matched to the upstream sequence. This ligated piece of DNA now serves as a primer, and the next labeled probe is ligated to this if it matches the upstream sequence. A very novel feature of this chemistry is that the perfect annealing of the probe is controlled by only 2 bases of the labeled oligonucleotides. Thus, this method has significantly higher specificity and a higher accuracy than the sequencing by synthesis approach. This sequencing platform is called the SOLiD system which stands for sequencing by ligation. (Figure 3)
Figure 3. Overview of SOLiD sequencing chemistry. Life Technologies, Carlsbad, CA. Used with permission.
The output on the first version of the SOLiD platform was limited to approximately 1 gigabase, acquired over a 10-day run. This original SOLiD machine has been continually upgraded over the last 2 years with improvements being made by increasing the purity of the reagents (giving higher accuracy and lower background), increasing the number of bases read per run, and developing an overall easier workflow. Furthermore, dramatic improvements to the number of small beads that can be deposited and analyzed on a single slide coupled with increased read length has rapidly increased the output on the current SOLiD 3 Plus instrument to over 60 gigbases in a 10-day run.
The principal strength of the SoLiD platform is that greater sequence accuracy can be obtained. However, this platform has limitations when compared to the previously described 2 platforms. Compared to the GS FLX and GA, the SOLiD platform is the slowest and the length of the sequence obtained is the shortest. The original SOliD machine could generate only 35 base pair reads, but the current versions can generate 50 base pair reads in about 5 days. A second limitation, which this platform shares with the Roche GS FLX system, is the difficult and time-consuming emulsion PCR step. Life Technologies is scheduled to release 3 small machines in spring of 2010 that are advertised to simplify and automate much of the emulsion PCR steps, which should enhance the workflow on this platform.
The next upgrade on the SOLiD platform will be the SOLiD 4hq that will be available by the end of 2010. This platform will provide a sequencing machine that is claimed to be capable of generating over 300 gigabases of DNA sequence in a 14-day run. This upgrade is not a new instrument with a different design, but an upgrade of the existing SOLiD machine. The reagent cost for sequencing an individual genome on this platform is currently being advertised at $3,000. This machine will be quite comparable to the Illumina HiSeq 2000, capable of generating the complete genome sequence of 6 individuals per month. Further improvements on increasing bead density on the slides for this platform and increasing read lengths even further should enable this platform to continue to dramatically increase sequence output in the next few years.
The Second Next Generation Instruments: Single Molecule Sequencers
The dramatic output of DNA sequence achieved by the first Next Generation DNA sequencers was obtained by utilizing massively parallel DNA sequencing and very small reaction volumes significantly reducing reagent costs. However, all of these platforms depend on PCR to create sufficient mass of the DNA fragments to be analyzed. Sequencing of these amplified individual templates causes 2 problems. First, PCR is relatively expensive and time-consuming. Second, the amplification process is not perfect and can in rare cases introduce erroneous bases into the amplified products. These errors are perpetuated in the DNA sequence obtained from these amplification products, which ultimately increases the error rates of these technologies. One solution to both problems is the development of sequencing platforms that can analyze individual single molecules of native DNA without the need for PCR amplification.
The first single molecule sequencing machine was developed by Helicos.5 This platform uses individual DNA fragments that have been tailed with poly dA (polydeoxyribonucleotides made up of deoxyadenine nucleotides and thymine nucleotides) and then annealed to a slide containing a lawn of oligo dT (oligodeoxythymidine) primers. The slide is then sequentially exposed to each of the 4 bases and imaged to determine which DNA fragments have incorporated a specific nucleotide. This machine produces very short sequence reads, but on hundreds of millions of templates, yielding multiple gigabases of DNA sequence per run. This machine has been commercially available for the past year, but the high cost of the machine (over $1.3 million), very short read lengths, and a much higher error rate have severely limited its popularity to date.
Another promising single molecule sequencer that will soon be available is being produced by Pacific Biosciences.6 This sequencing platform tethers a single DNA polymerase at the bottom of an optical chamber known as a zero mode waveguide (ZMW). The ZMW is a structure that creates an illuminated observation volume that is small enough to observe (by laser-induced fluorescence) a single nucleotide of DNA being incorporated by DNA polymerase. The 4 nucleotides are fluorescently-labeled and as they are incorporated by the DNA polymerase, the cleavage of the fluorescent terminator by the polymerase can be detected prior to the next base being incorporated. Thus, this instrument can obtain sequence information at the processivity rate of DNA polymerase, which is several hundred bases per second. The first commercially available Pacific Biosciences machine will have 80,000 ZMWs and will be capable of generating 80 megabases (Mbs) of DNA sequence in just 10 minutes.
Three major areas requiring dramatic improvement are the total number of ZMWs per slide, the length of the DNA fragments that can be sequenced, and an improvement to the relatively high overall error rate. The very low cost of running this machine (1 flow cell for this machine only costs $100), coupled with its rapid sequence time could make this an ideal instrument in the clinical laboratories if the above challenges can be met.
In this era of very rapid improvements to sequencing technology, sequencing output per unit cost has continued to increase nearly 10-fold per year. The direct effect of all this development will be that the cost for a complete genome sequence will soon be less than $1,000, and likely will continue to decrease even further over time. The ability to generate a complete genome sequence for an individual at very low cost, considerably lower than it currently costs to generate the sequence of a single large gene, will have a dramatic effect on the types of laboratory testing being performed.
The Department of Laboratory Medicine and Pathology (DLMP) at Mayo Clinic is a national leader in the development, implementation, and support of clinical laboratory genetic services. As we look forward into the next decade of diagnostic testing, the technology discussed here will transform current practices.
Mayo Clinic DLMP has for some time utilized traditional fluorescent DNA sequencing, such as that used to sequence the first human genomes, as a vital component of the overall testing strategy. However, future clinical sequencing needs will demand the capabilities provided by these next generation sequencing technologies since future assays will not be limited to a small number of genes. Rather, they will extend to much larger genes and gene panels. For example, rather than interrogating a single hereditary colon cancer gene, panels of genes will be developed that allow both common and rare causes of disease to be simultaneously interrogated. This approach will be far more comprehensive than previously possible, and will be used to construct panels for a number of additional clinically relevant disorders ranging from hereditary colon cancer to hypertrophic cardiomyopathy and on to mental retardation, just to name a few.
In addition to these early clinical projects currently being explored, there is a vast sea of possibilities in the fields of high-resolution HLA genotyping, HIV-drug resistance, microbiome sequencing, and even complete human exome sequencing. Ongoing research will lead to even more possibilities regarding potential applications of this technology.
A target for Next Generation sequencing is the relatively small genome of the mitochondria. The mitochondrion occupies a unique position in eukaryotic biology. First, it is the site of energy metabolism, without which aerobic metabolism and life as we know it would not be possible. Second, it is the sole subcellular organelle that is composed of proteins derived from 2 genomes, mitochondrial and nuclear. A group of hereditary disorders due to mutations in either the mitochondrial genome or nuclear mitochondrial genes have been well characterized. Mitochondrial disorders are a group of diverse genetic diseases caused by deficiencies in 1 or more proteins involved in cell respiration and energy production. These diseases affect multiple organ systems, are progressive, and can occur throughout one’s lifespan. These metabolic disorders may occur as commonly as 1 in 3,000 births, but are frequently difficult to diagnose because of their variability of phenotypic presentations and diversity of disease onset. A significant fraction of these disorders are caused by 1 of a few well-known mutations in the mitochondrial genome, but the underlying genetic basis of many of them are unknown. With current sequencing technology, analysis of the entire 16,568 base pair mitochondrial genome is difficult, time consuming, and expensive.
Mayo Clinic researchers have developed a Next Generation sequencing assay for the complete mitochondrial genome. The accuracy, precision, and analytical sensitivity of the sequencing method have been determined. Currently, efforts are focused on streamlining the workflow, decreasing the cost, and final validation. This Next Generation sequencing-based assay will be very useful to clinicians in characterizing this difficult-to-diagnose set of genetic disorders.
Another important application of Next Generation DNA sequencing will be to detect early mutations in cancer. In this method, a small number of sequences with mutations and alterations must be detectable within a much larger group of nonmutant sequences. By amplifying specific chromosomal regions that are known hot spots of mutation in cancer (for example, regions within p53 or mutation hot spots in KRAS) and then “deep sequencing” those regions on a Next Generation DNA sequencer, this technology could potentially detect these rare events. While this process has a number of important technical challenges, including a high error rate, it could provide a powerful means for the early detection of cancer.
Using this method, different biological fluids would then provide optimal source material for the screening of distinct cancers. For example, urine would provide a useful source for the detection of mutant DNA molecules coming from cancers of the bladder. Stool would be an ideal source to search for mutant DNA molecules arising from the tissues that are part of the aerodigestive tract (head and neck, lung, esophagus, stomach, pancreas, and colon). A great deal of research still needs to be performed to define the relevant genomic regions to be interrogated and to define the clinical sensitivity, specificity, and predictive value of this rare mutation detection.
Another important application, taking advantage of the “deep sequencing” capabilities of Next Generation DNA sequencing, is the ability to monitor patients for residual or recurring cancer after treatment. A defined panel of frequently mutated chromosomal regions similar to the one employed for the early detection of cancer could be used to examine patients for the presence of rare cancer cells or cancer cell recurrence after treatment. This monitoring could be extremely sensitive for detecting minimal residual disease.
Hereditary Colon Cancer Syndromes
The study of hereditary colon cancer is a cornerstone of the clinical molecular genetics laboratory at Mayo Clinic. We continue to be one of the leading molecular diagnostics facilities providing screening and diagnostic services for individuals suspected of having either hereditary nonpolyposis colorectal cancer (HNPCC), familial adenomatous polyposis (FAP), or MYH-associated polyposis (MAP)-related colon tumors. Currently, the 5 most frequently reported genes involved in tumor development include 3 mismatch repair genes involved in HNPCC-related tumors: hMLH1, hMSH2, and hMSH6, and 2 polyposis related genes, APC and MYH. Together these genes consist of 75 exons, covering approximately 24,000 bases of coding sequence, and approximately 280,000 bases of genomic DNA. Currently, each of these genes is interrogated individually using standard Sanger DNA sequencing. Applying the Next Generation methodologies discussed in this article would allow these genes to be analyzed en masse at a much more cost-effective price point. Furthermore, this panel of just 5 genes could easily be increased to include additional genes important for diagnosing other rare colon cancer syndromes—all while maintaining the cost-effectiveness of the current testing.
Characterization of Lymphoma Genetics
Mayo Clinic is a worldwide leader in translating novel discoveries in lymphoma biology to advances in the clinical care of lymphoma patients. A central genetic feature of many lymphomas is the presence of chromosomal translocations: interchange of genetic material between 2 remote areas of the genome that can form new fusion genes or dramatically influence gene expression. We have successfully utilized a cost-effective Next Generation sequencing approach to detect translocations in a highly lethal subset of lymphomas called peripheral T-cell lymphomas. This technique, called mate-pair library sequencing, already has allowed us to identify a novel, recurrent translocation in a subset of peripheral T-cell lymphomas with distinct clinical, pathologic, and molecular features. This approach can be translated to clinical practice in 3 distinct and complementary ways: 1) Translocations that correlate with distinct clinicopathologic features can serve as biomarkers for diagnostic and prognostic use; 2) Genes involved in novel translocations can be investigated in experimental models to identify molecular targets for new therapies; and 3) Mate-pair library sequencing itself may represent a cost-effective diagnostic platform to identify translocations and other genetic anomalies in clinical tissue specimens.
The area of cardiovascular genetics has been evolving in recent years with the identification of novel genes involved in cardiovascular-based genetic disease. For example, inherited connective tissue disorders with cardiovascular involvement, such as Marfan syndrome, can be caused by one of any number of genes. These disorders have overlapping symptoms that can make them difficult to accurately diagnose on the basis of clinical presentation alone. Thus, it is often necessary to perform genetic testing in order to provide accurate diagnoses and optimal genetic counseling to the family. Many of the genes involved are large (eg, the FBN1 gene involved in Marfan syndrome has 65 exons), and, thus, it can be expensive to sequence each of these genes to determine which one, if any, is involved in each patient or family. Multiple genes (many of them large) are similarly involved in other cardiovascular genetic disorders. For example, at least 15 different genes are known to be involved in familial hypertrophic cardiomyopathy, a disorder that occurs in 1 of 500 individuals. Over 25 genes have been shown to cause familial dilated cardiomyopathy. With the advent of Next Generation sequencing, the ability to sequence panels of multiple genes more quickly and cost-effectively, compared to traditional Sanger sequencing, will be of tremendous benefit to families with inherited cardiovascular-based syndromes.
Pharmacogenomics relies on the genetic make-up of an individual to predict drug response and efficacy, as well as potential adverse drug events (ADEs). Current clinical applications for pharmacogenomics include predicting response to and preventing ADEs from chemotherapeutics, psychotropic medications, coagulation-based drugs, and other medications. In the clinical setting, the most widely used technology for pharmacogenomics testing is targeted genotyping of specific variants found in one or more genes involved in drug response. The drawback to this technology is the inability to detect rarer variants that may affect drug response. One way to overcome this limitation is to sequence the genes; however, high cost and length of time to obtaining sequencing data from multiple genes generally precludes this type of technology in the clinical pharmacogenomics setting. By applying Next Generation sequencing to the clinical pharmacogenomics arena, rare variants can be identified more efficiently than with traditional Sanger sequencing. In addition, the high-throughput nature of Next Generation sequencing would lend itself well to the world of pharmacogenomics, where drugs are often metabolized by multiple enzymes, thereby requiring analyses of panels of genes that encode these enzymes. Thus, in the clinical setting, Next Generation sequencing could provide for more comprehensive analyses, while maintaining cost-effectiveness and time efficiency. In the future, examination of a person’s entire genome or exome via Next Generation analyses could assist health care providers in their efforts to make individualized treatment decisions. The integration of Next Generation data, information management systems, and pharmaceutical databases could provide prescribers with important tools for optimizing personalized medicine.
The development of these new technologies will have a significant impact on the types of clinical tests that Mayo Clinic Department of Laboratory Medicine and Pathology will provide in the near future. We are quickly approaching a new age when whole human genomes will be sequenced at a very low cost. Procedures that cost over $200 million in reagents in 1999 cost less than $10,000 in 2010. DNA sequencing capabilities continue to increase exponentially, while the cost per base plummets. The cost of generating a complete genome sequence will soon be less than $1,000 and, shortly after that, less than $100. Inevitably, at these price points, full genome sequencing and analysis will become an integral part of clinical practice. The key will be to understand how to translate that readily available information into informed clinical decisions for the best healthy living recommendations and treatments prior to, during, and following disease.
Authored by Dr. David I. Smith, Dr. Matthew Ferber
and Dr. W. Edward Highsmith with contributions from
Dr. Linnea Baudhuin and Dr. Andrew Feldman