The Next Generation Sequencing Revolution
Next Generation Sequencing Part 1
Click CC to turn on closed captioning.
Published: December 2013Print Record of Viewing
Next generation sequencing is rapidly evolving. Technological improvements in instrumentation and methodologies have decreased the cost of genome sequencing and offer additional opportunities for clinical utility.
Presenter: David I. Smith, PhD
- Experimental Pathology and Laboratory Medicine at Mayo Clinic in Rochester, Minnesota
Questions and Feedback
TranscriptDownload the PDF
Welcome to Mayo Medical Laboratory’s Hot Topics. These presentations provide short discussion of current topics and may be helpful to you in your practice. Our speaker for this program is Dr. David I. Smith, Ph.D., from the Division of Experimental Pathology and Laboratory Medicine at Mayo Clinic. Dr. Smith reviews how technological improvements have increased the speed and capacity of DNA sequencers and discusses expectations for future improvements in genome sequencing.
Thank you, Heidi, for that introduction.
I have no disclosures to report relative to this work.
Today, I’m going to talk about advances that have occurred in DNA sequencing that have made it possible to sequence the human genome and are going to transform the way we use this technology in both research and clinical practice. The human genome is actually a composite of a number of different things. We each inherit 3 billion base pairs of DNA sequence from our mothers; we also inherit 3 billion base pairs of DNA sequence from our fathers. But in addition, our mitochondrial genomes are inherited solely from our mother; and surprisingly, we are not just a single organism. There are hundreds of thousands of bacterial and viral genomes that are a part of us too, and those genomes actually vastly outnumber the number of human genomes.
The way that DNA replicates is illustrated on this slide. A nucleotide is brought into a growing chain, which is a template and a primer, and that nucleotide is incorporated into the chain, and then the hydroxyl group, which is sitting at the end of that nucleotide, is the site for the addition of an additional base; and it is the sequential base addition, which is how DNA is replicated. It is this basic replication that is copied in the standard procedure for DNA sequencing known as dideoxy DNA sequencing.
In this protocol, both a template and an oligonucleotide primer are the starting material; and then we have 4 separate reactions. The first reaction has a low concentration of a dideoxy adenosine nucleotide. What this means is when that dideoxy nucleotide is incorporated into a growing chain, it terminates that chain at that position. However, since there is only a very low concentration, most of the molecules elongate further, but some molecules will stop at the next nucleotide. And by doing 4 separate reactions (1 with dideoxy A, 1 with G, 1 with C, and 1 with T), it is actually possible to interrogate the sequence of a small DNA fragment, and this basic technology is the technology that is utilized for sequencing something as complex as a human genome.
However, the simple technology had to be developed into an instrument, and this is the guts of that instrument. It is the ABI-3700 machine. This machine is an automated DNA sequencer, which uses dideoxy sequencing; but instead of running the sequences out on a gel, actually runs the sequences through very small capillaries. This device has 96 capillaries and, hence, is known as CE or capillary electrophoresis. DNA fragments run through the capillaries; and the nucleotides at the end, which are fluorescently labeled, are detected as colors. And then from that, one can infer the DNA sequence of the small fragment.
The company, Celera Genomics, which was developed by Craig Venter, took this technology and basically utilized what is known as whole genome shotgun sequencing. This technology was developed by Celera Genomics, which was developed by Craig Venter, and they utilize whole genome shotgun sequencing. What this means is that the DNA from an individual--that individual happened to be Craig Venter himself--is fragmented into small pieces, and each of those pieces is basically sequenced individually. Hence, it’s called shotgun sequencing because all the fragments are just shot out and sequenced. Craig Venter took 330 ABI 3700 machines, and that was actually the machinery that was used for sequencing. They produced small insert libraries, which were cloned into E coli. The have robotics for picking the millions of colonies, and then the DNA sequences that are obtained are actually assembled into an entire genome sequence using a supercomputer. This took 3 months of sequencing to generate over 20 gigabases of Craig Venter’s genome.
To sequence the Venter genome, the small fragments had to be cloned into plasmid vectors in E coli. If it did not clone in E coli, it would not be sequenced. In addition, the dideoxy sequencing is done in small 10 mcL reactions, but this still contains over 1013 molecules, which need to be sequenced in each individual reaction. So they obtain up to 60 gigabases of sequence, and then supercomputers were used to assemble the draft sequence of the first individual--Craig Venter. The approximate reagent cost to do all of this is over $200 million, but the total cost to generate the technology and set up the factory was $3 billion.
Now, the reason that this is so expensive is: First everything had to be cloned into plasma so that one can obtain large numbers of molecules of each fragment which needed to be sequenced; and then secondly, as I mentioned before, each 10 mcL reaction requires considerable dideoxy nucleotides, which are the chain terminators, and other reagents. So this brings up the question—How can we reduce these costs to bring down the cost of genome sequencing?
The answer is first, to replace the cloning step. If instead of cloning into E coli, one could use polymerase chain reaction or PCR-based strategies to amplify individual fragments, then all of the pieces of DNA can be obtained; and there are 2 different strategies for this type of PCR amplification. The first one is known as Bridge Amplification, and the second one is known as Emulsion PCR.
The slide here shows how Bridge Amplification works. DNA fragments that are going to be sequenced are shown on the very first slide in the top left. These small fragments are tailed with oligonucleotide primers of 2 different flavors. We could call them A and B. These DNA fragments are then flowed across a surface, which is shown in the second slide; and on the surface of this slide are small oligonucleotide primers, which are actually complementary to both A and B. So an individual fragment binds to one of these either from the A end or the B end, and that fragment then can fold over; and the opposite side, which has a different primer, can find its complementary primer. This becomes the template for a replication; and when you start after the replication, you end up with 2 DNA fragments. When these are denatured, they too will bend over; and ultimately, what you generate is a small little forest of amplified fragments. And on the slide surface, there are first thousands and then hundreds of thousands and then millions of individual DNA fragments which have been amplified in this way.
The alternative to this procedure is known as Emulsion PCR. The DNA fragments are fragmented in a similar way. They are tailed with oligonucleotides (also can be A and B); but instead of amplifying on a surface, the actual amplification step occurs out on a tiny bead. That bead is actually in a tiny water droplet in a lipid matrix. These tiny little water droplets are individual amplification vessels; and inside of these vessels, that individual DNA fragment which is in there is amplified to tens of thousands to as many as 1 million copies. Each bead then has amplification of only a single DNA sequence. The beads themselves are so small that in a small microcentrifuge tube, which is shown on the left, there can be millions to hundreds of millions of these tiny beads. So with these 2 procedures, either with Bridge Amplification or Emulsion PCR, individual DNA fragments can be amplified to enough copies so that one can begin to do DNA sequencing. However, in contrast to the sequencing that’s done with capillary electrophoresis, there is one-millionth to 100-millionth as less DNA molecules that are being analyzed, and this is going to help to cost less for the sequencing.
The second thing is strategies to reduce the reaction volume. Again, a 10-mcL reaction--10 mcL is only one-hundredth of a mL, but it still uses considerable amounts of reagents. So there are strategies to reduce the reaction volume – 2 solutions: Very small individual cells whose volume is considerably smaller, or to flow the reagents over millions of small clusters, such as the clusters that are generated with Bridge Amplification.
The third thing is to use the same strategy, which is how computers have become so fast, and that’s Massively Parallel Computing. By taking millions of processors and working together, one can get tremendous speeds out of computers. Similarly, instead of doing individual DNA sequencing, if one does Massively Parallel Sequencing, one can obtain large amounts of information. So hundreds of thousands to millions of simultaneous sequencing reactions are run, and this is how the output on these things can be considerably greater.
The 2 platforms that I’m going to describe: The first platform is the Illumina Genoma Analyzer. This machine is shown on the slide right here,
and this platform, as I was describing before, first uses Bridge Amplification, which was previously shown and is shown on the right of this slide. The actual amplifications occur on a flow cell, which is shown on the left of the slide. This flow cell has 8 lanes on it; and within each lane, there is the capability of hundreds of thousands of sequencing reactions.
The actual sequencing itself is done with polymerase-based sequencing using reversible terminators. What this means is that each of the 4 nucleotides (A, C, G, and T) are actually fluorescently labeled a different color, but the fluorescent label itself blocks that nucleotide so that no additional nucleotides can be added. Hence, at each step of the sequencing, 1 base is added to each of the different clusters shown on the left of the slide. After that is done, a camera takes a picture to see which color is with which cluster, and then a series of chemical steps are done to remove the fluorescent blocking group; and then that sequencing slide is ready for the interrogation of the next base. And by doing this sequentially, and this is shown on the right of the slide, you take pictures of the first base incorporation, the second base, the third base, etc, and you are actually assembling the DNA sequence by following a specific spot through iterations of nucleotide incorporations.
The strengths and weaknesses of the Illumina Sequencing Platform: initially, the reads were extremely short in length, only 31 base pairs. This is in contrast to 1000 base pair reads, which are routinely obtained with capillary electrophoresis. But with improvements in technology in just 4 short years, the readlets have increased, so now they are up to 100-151 base pairs in length. In addition, although there’s very high accuracy at the beginning of a read, there’s considerably lower accuracy towards the end, and this is a weakness. However, the strength of this platform is that there is dramatic room for improvement in sequence output, and a very important other strength is that the dye terminators only allow 1 base to be incorporated at a time.
This machine, the HiSeq 2000, was produced by the same company, Illumina, about 3 years after the Illumina Genome Analyzer came out, and this machine had even greater output.
The HiSeq 2000 also can run 8 lane flow cells, but they can run a smaller flow cell, which just contains 2 lanes; and the current output on this machine is a phenomenal 6 billion reads per run or up to 600 gigabases of DNA sequence output.
An alternative strategy, and this is another platform, is a company that has developed something called the Ion Torrent, and this is produced by Life Technologies. In contrast to the other technology from Illumina,
this uses a completely different strategy both of amplification and of DNA sequencing. The actual sequencing is done on this machine on a chip, a chip which can actually be produced in very much the same way that computer chips are produced, and this gives you the capability for scalability, simplicity, and speed.
Now, this technology -- the actual chip itself is produced with semiconductor manufacturing on a wafer, which is shown on the top left slide. On the bottom right slide is what the actual chip looks like. What you have is millions of very, very small wells, and each well is actually wired up in such a way that it can detect changes in the chemistry inside the well; in this specific instance, they are each pH indicators. This becomes important because when a base is added to a growing chain in DNA sequencing, hydrogen is released. Each time a single base is added, a single hydrogen atom is released. Well, if you’ve amplified DNA fragments onto the bead, amplification in this platform is with emulsion PCR; then you will have 30,000 DNA molecules on that. Each of those 30,000 molecules will release a hydrogen, and that well will detect the change because 30,000 hydrogen molecules will change the pH in that well. So by flowing nucleotides over that, when the nucleotide is incorporated, one will see a peak of a pH change, and that will tell you that that nucleotide has been incorporated at that position; but across the entire chip, you are doing this simultaneously for millions of molecules.
Now, here is just showing the hydrogen being released when a single nucleotide is being put on. One of the real strengths of this platform is instead of using very expensive nucleotides that are fluorescently modified, this just uses naked nucleotides, which are much cheaper to obtain, and it’s much easier to work with.
The direct detection that is seen by flowing first A, C, G, and T over the entire surface, those wells that contain a bead which is going to incorporate in A are going to release hydrogen atoms, and those hydrogen atoms are going to be detected; and over time, one can determine which bases were added when.
There are strengths and weakness to this platform. One of the strengths is it has longer initial read-lengths. The second big strength is cheap reagents so that you don’t need expensive dyes attached to your nucleotides. There is dramatic improvement in this technology and room for considerably more improvement. However, since the bases themselves are just naked nucleotides, they are not blocked, hence a major problem is homopolymers. This refers to a stretch of nucleotides of the same sequence; and if there are 10 ‘A’s in row, 10 ‘T’s will automatically be added, and it is sometimes difficult to determine exactly how many bases have been incorporated. A second weakness and problem with this platform is that the Emulsion PCR, which is how fragments are amplified on it, is problematic as compared to Bridge Amplification.
The Ion Proton System is the evolution of the Ion Torrent System, and this platform is basically a benchtop system, which contains state-of-the-art electronics to support very high output. It is run by a Dual 8-core Intel Xeon Sandy Bridge processor for analyzing what’s going on. This has a tremendous amount of RAM (128 gigabytes); and in addition, this thing is entirely set up so that when you put a chip inside of this, you can determine the DNA sequence from an individual in less than 24 hours. So this machine, the Ion Proton, is a DNA genome sequencing machine.
Now, this slide here shows the evolution of instrument performance, specifically of the Illumina platform, but the advances that have occurred in the Ion Torrent platform are followed a similar trajectory. So, when this platform was first introduced in 2007, when one ran a single chip, you had to struggle to get 1 gigabase of DNA sequence output. On the left-hand side of the slide is the amount of gigabases one can obtain from a run at different points in time, and the very first instrument was shown in purple is the Genome Analyzer (GA); and over time, this was improved such that by the end of its run, one could obtain up to 60 gigabases of DNA sequence. The HiSeq 2000, the replacement for this platform when it was first released, was capable of 200 gigabases of DNA sequence. Within a year, that was up to 400 gigabases. And this just shows that they have now obtained 1 terabase, this is 1000 gigabases of DNA sequence. This is quite significant because this is sufficient DNA sequence to actually obtain the full genome sequence of 10 individuals. Because of these dramatic increases in output, basically increasing 1000-fold in output at the same cost over 4 years, we have seen a dramatic decrease in the cost of sequencing a genome.
So if you see in 2001, which is about the time when the human genome project was completed, just before that, the cost for doing a single genome was in excess of $100 million; and yet today, by the beginning of 2013, the cost is now less than $3000 per genome. This is quite significant. This is almost a million-fold decrease in cost for sequencing a genome. The most exciting thing about this is the trajectory of this curve. While it may cost $3000 today, within 2, 3, 4 years, the cost of a genome may be $100 or less. Because of the decrease in genomes, cost per genome to sequence it --the number of genome sequenced continues to climb. Today, there is an excess of 30,000 genomes, which have already been sequenced, and this is just going to increase geometrically. And the amazing thing is because the cost is going down so much, this is beginning to migrate out of the research laboratory; and there are a number of important clinical applications of this powerful technology.
So, I’ve been discussing next generation sequence as a tool for whole genome sequencing, but it turns out there are many, many uses for next generation sequence. In addition to whole genome sequencing, it can be used as a replacement for array comparative genomic hybridization and cytogenetics; and this is going to be a very important clinical application in the next few years. In addition, one can sequence the exome. This is the portion of the genome that actually codes for protein and is really a small percentage of the entire genome but is a hot spot for where many important things occur, and that can be done for considerably less. In addition to those types of sequencing, one can also sequence the transcriptome. The transcriptome refers to that portion of the genome that is actively transcribed in a specific cell, and this can tell you what’s going on in a cell, not just what’s coded in the cell, but what is actually being produced. In addition to all of these things, 1 additional thing that this sequencing can do is determine the methylation. Methylation is an epigenetic modification of the DNA sequence that helps to control gene activity.
For sequencing just a portion of the genome, traditionally now, molecular genetics does a variety of tests by amplifying each individual exon as a separate PCR reaction, and then those are sequenced by capillary electrophoresis. While instead of doing that, one can use this technology of next gen sequencing to pick a panel of genes (10, 50, 100, or 200), and they can all be sequenced together. If you want to bring down more than just a small proportion, one can bring down the entire exome, which is the 200,000 exons, which is only 38.5 megabases of DNA sequence, instead of the 3 gigabases, which is the entire genome. However, to do any of these, a small portion of the genome, the entire exome, one must first capture the region of interest.
There are a number of technologies to do that, but the Sure Select Technology, which was developed by Agilent Technologies, is one very powerful way of doing that. The DNA sequence from the genome that you have is shown on the top left. Those fragments are cut into small pieces, and then they are tailed with oligonucleotide primers, which will help you at the end to PCR amplify what you have captured. In order to capture the specific sequences of interest, this is shown on the right, these are biotinylated RNA library or BAITS. The BAITS are actually the DNA sequences that you’d like to capture. They are actually synthesized on a wafer; and then removed from that wafer, those DNA fragments are then coated with biotin. Biotin because then they can be very easily pulled out of a mixture by using something that binds very strongly to biotin, which is streptavidin. The BAITS, which are biotinylated, are then mixed together with the whole genome DNA, and they are allowed to hybridize; and what you generate is hybrids between the BAIT and its capture sequence, and then those sequences which are not captured just re-anneal back to themselves. By adding these to magnetically-coated beads--beads that are coated with streptavidin--the streptavidin very strongly binds only to the biotinylated DNA BAITS plus the DNA sequences that they’ve captured, you simply use a magnet to pull the magnetic beads over to one side. These can then be washed. The DNA fragments which have been specifically bound to this are then eluded. The PCR primers at the end of it can be used to amplify that up, and then that can be directly sequenced. And this is a fine-tune way of pulling down just a specific portion of the genome, a bunch of genes, or as much as the entire exome, and then that can be directly sequenced.
The conclusions are: The Next Generation Sequencing Revolution is here. The cost for whole genome sequencing is approaching and will then be less than $1000. Next Generation Sequencing can do a whole lot more than just whole genome sequence, and this will transform clinical practice and very quickly. In a subsequent lecture, I am going to be describing how this very powerful technology can be used for a number of clinical applications and in the near future and in the long term. Thank you.