DNA Testing - The Next Generation
High Throughput, High Content Technologies
Click CC to turn on closed captioning.
Published: May 2009Print Record of Viewing
Technology advances are rapidly expanding our ability to detect genomic differences in normal and disease states. Dr. Smith reviews how these technological improvements have increased the speed and capacity of DNA sequencers, expectations for future improvements, and the impact that this will have on clinical practice.
Presenter: Dr. David I. Smith, PhD
- Division of Experimental Pathology and Laboratory Medicine at Mayo Clinic
Questions and Feedback
Welcome to Mayo Medical Laboratories' Hot Topics. These presentations provide short discussion of current topics and may be helpful to you in your practice.
Our presenter for this program is Dr. David I. Smith, PhD, from the Division of Experimental Pathology and Laboratory Medicine at Mayo Clinic. Technology advances are rapidly expanding our ability to detect genomic differences in normal and disease states. Dr. Smith reviews how these technological improvements have increased the speed and capacity of DNA sequencers, expectations for future improvements, and the impact that this will have on clinical practice.
The Human Genome Project
The Human Genome Project was a plan to sequence the 3 GB genome at 99.99% accuracy and in addition to sequence multiple model organisms or importantly to develop technologies to do this.
ABI 3700 Machine
And the state-of-the-art machine that enabled the sequencing of the human genome was the ABI 3700 which was a machine which 10 years ago cost $250 million. The capability of this machine was that 96 samples could be run simultaneously through very long capillaries and a 3-hour run time enabled, with hundreds of these machines running simultaneously, the complete sequencing of the first reference genome in three months time.
That was done and developed by Craig Venter. His genome was the genome which was sequenced. They utilized whole genome shotgun sequencing with 330 ABI 3700’s, small insert libraries were then cloned, robotics were used to pick the colonies, and then after all the sequence was generated, a supercomputer was used for the assembly. Hence, in three months of sequencing, the 20+ GBs of Craig Venter’s genome was sequenced and this generated the first referenced human genome sequence. This was very impressive technology but technology moves on.
Greatest Impact of the Human Genome Project
The very first impact was that high throughput methodologies came into place and these technologies were rapidly developed. Well, in 2001 the cost of the human genome was about $200 million dollars. But how much does that cost today - because it always costs the most to do something the first time. This is where the technology of NextGen Sequencers comes in. One of the key things of the ABI 3700 was the fact that multiple samples were processed simultaneously. Instead of a single sample at a time, 96 samples were done together.
Well, one of the first machines to come out is a machine developed by 454 Life Sciences, which was purchased by Roche, which is enabling the process to be accelerated much more dramatically. This is a picture of that machine.
The way this works is that instead of working with a 96-well plate, a single fiber optic cable is actually sliced. This middle slide here actually shows what happens if you slice a fiber optic cable and put a little plastic sheet on the bottom of it. What one has generated is a plate that holds 1½ million tiny wells. Each of those wells is 32 microns in size so if an individual bead which is exactly 28 microns in size is placed in the well, if those beads have DNA on them, one can actually do sequencing on those beads. Instead of doing 96 sequences simultaneously, one can do over a million sequences simultaneously. One can generate an impressive amount of sequence data from so many distinct sequencing centers. The original fragments of DNA are denatured and made into something that is single stranded. An individual single-stranded piece of DNA is put onto an individual bead and then with the process of PCR amplification, that initial copy is multiplied many times. Then each bead pretty much has the same sequence amplified many times on it and bead one will do sequence one, bead two will do sequence two, bead one million will do sequence one million. The samples are placed into an apparatus and then they are flooded with the first base being A. If A is the first base to be incorporated, it will produce a signal. Through pyrosequencing one will see a light signal coming off from that. A CCD camera at the very top takes a picture and every single bead that has an A in them will give a color. You then flood with a C and every single bead that has a C will give a color.
One Fragment = One Bead = One Read
One continues this process so that with one fragment/one bead/one read, you start with your sample whatever it is. You then fragment that sample to produce small fragments from 300 to 800 base pairs. After ligation, so that there are adaptors on the outside of them, they are then put so that one fragment is on one bead and then the beads are clonally amplified. The beads are put into the sequencer and one gets a digital output of real-time sequencing by synthesis.
So the complete workflow: you see that you start in the original generation, you did 420,000 of these sequences. It’s massive parallelization. The same technology that yields computers with much greater speed by use of massively parallel processing, you are doing the same thing with sequencing. You get sample prep and you get data generation.
Capability of This Machine
The capability of this machine when it first came out - it was called the Genome Sequencer 20 because it could do 20 million base pairs of sequencing. This was a machine that was 100 times more powerful than the ABI 3700, its predecessor. But very quickly this machine was upgraded; 5 times greater so that it could do 100 to 200 million base pairs of sequencing in just 12 hours and with this type of machine, 10 of these machines could do the human genome in just 10 days and it would only take 2 to 3 people to run this. The next upgrade, which has just become available, is called the GS FLX Titanium. This will do 500 Mbs of sequencing in 12 hours.
What's Been Accomplished with the 454?
What’s been accomplished with this? Hundreds of bacterial genomes; you can now put 10 different bacteria into the sequencer and get a complete genome sequence for them. The initial, and actually the full, sequencing of the Neanderthal which was just published, has been done on this. Similarly, they’ve done the initial sequencing of a frozen wooly mammoth. As a promo piece the sequence of James Watson, one of the key discoverers of the structure of DNA was completely sequenced and the cost for doing that in December 2006 was just a $1 million dollars. If one goes back just 7 years before that, the sequence of Craig Venter’s genome cost over $200 million dollars so we are already seeing a dramatic increase in capabilities and decrease in cost.
Illumina Genome Analyzer
The second machine that came out from the NextGen Sequencer is called the Illumina Genome Analyzer. This machine was initially capable of doing 2 billion base pairs of sequence per cell, 1.5 GB of sequencing each 2.5 days, and one requires very small amount of sample prep, between 100 nanograms to 1 microgram of sample to begin with.
Genome Analyzer Workflow
The way the Genome Analyzer workflow starts is one takes a sample, one generates clusters, and I’ll describe that very quickly in a second, that is put onto a flow cell. The flow cell has 8 different lanes so 8 different samples that could be run if you wanted to. When one generates hundreds of millions of sequences, you have an analysis pipeline that assembles those sequences together. And the Genome Analyzer is where the samples are run.
Flow Cell Preparation
The way one generates these samples is that a single-stranded molecule of DNA has linkers adapted on both sides. One of those linkers is complementary to the same color linker which is down on a slide and that sample can then fold over to make a bridge. Now you have a template and a primer so that if you add the corresponding nucleotides, you can make a copy of that. When one denatures that, one has two strands where you started with one and you continue this process for a series of times so that ultimately you generate a forest. A forest is an amplification of an original DNA fragment so that one has some 10,000 copies of that fragment.
But across the entire slide one has a forest of forests.
Raw Data is Images
Then the individual samples can be done with four different fluorescently colored nucleotides and if a nucleotide is added to one of these, that tells you the first base which is being added. You can see four different bases being added to these four different forests. You simply take a picture of this after each base addition. The bases are blocked so no additional base can be added. After you’ve take the picture you deblock the bases and you add the next base.
Now the raw data is the images so for 8 channels per flow cell on the original genome sequencer, there were 300 tiles. Each one of these lanes contains 300 tiny tiles but within each of these tiles there were 20,000 sequencing clusters. So there are literally 6 million sequences which are generated per lane of the flow cell. Now it turns out that it also generates a large number of images. There are 2400 images per reaction cycle, 86K images per 36 base pair read length, or 700 gigabases of image data, 400 gigabases of files. So this generates a tremendous amount of data which needs to be stored but it also generates a tremendous amount of sequence.
What's Been Accomplished with the Illumina Machine?
Sequence construction, in contrast to the 454 which produces less relatively long reads, this produces 100 times as many shorter reads. So you start with images which you see as colors over here which are then converted into images which are then converted into bases and then these small 36-50 base pair reads are aligned together and then from there you can generate what has occurred in your original sample. There are 4-6 million reads per flow cell and this is with the original genome sequencer which has now been upgraded. The original genome analyzer is now up to a Genome Analyzer 3 in less than a year and a half time so this technology is moving along at an incredible pace.
SOLiD System - Overview
What’s been accomplished with the Illumina Machine? The very first person of Chinese ethnicity from Han ethnicity: Xinhua sequence; the first person of African descent: the Yaruba sequence. And the cost to sequence your genome last April was $50,000. My advice would be to wait a few months because it’s just going to cost less. The first upgrade of the genome sequencer originally it could do 1 GB in three days; it’s up to 3GBs. But in April there was an 8 GB run and today, I have heard of runs over 20 GBs per run.
SOLiD System 2.0 - Launch May 2008
A third system to come out is a system called SOLiD. It was developed with technology developed by George Church at Harvard University. They were purchased by Applied Biosystems and they use a genetic analysis platform. Again, massively parallel sequencing of clonally amplified beads. The sequencing is based on ligation of dye-labeled oligonucleotides and it can generate a large amount of sequence data.
SOLiD System Summary Overview
The SOLiD 2.0 came out very shortly after the SOLiD 1.0 and they improved the chemistry so there was much higher throughput; they improved the workflow; they improved the chemistry.
Higher Accuracy - New Probe Mix 1,2 Probes (Version 2)
Here is just an example of sort of the overview of what is commercially supported versus what has been demonstrated but the read length which started at small size (35 base pairs) are now over 50 base pairs. The amount of throughput is such that on an average run one can generate some 15 to 17 GBs of sequencing. This is important because this means that a complete human genome sequence can be generated on one run with one of these machines. There’s a variety of different applications of this technology that is shown here, depending on what you want to do because these technologies can not just be used for whole genomes but for portions of the genome and I’ll outline that a little later.
The way this technology works is you have your little template over here. This is on a small bead. The beads for the 454 are 28 microns; these are much smaller beads as they are only 1 micron but that means you can do a lot more beads. Your sample is put on; you put a universal primer that comes down to the sequence, and you have the different nucleotides. The one that goes into that position can be added in and ligase will come in and ligate that together and then one can determine the sequence of two of the bases.
Sequential Rounds of Sequencing 1,2 Probes (Version 2)
Then, one reads the sequence along. When you are finished, you go back to the sample. You remove the universal sequence primer, you come back in with the primer which is one base further back, and one goes back again and actually, every single base is interrogated twice with this type of sequencing. So you call the 1st and the 2nd, the 6th and the 7th, the 11th and 12th. When that is stripped away, you come back and you call zero and the 1st, the 5th and the 6th, and every single base and you can see this in the little bars are enclosed is actually sequenced twice.
So, What Can You Do with a NextGen Sequencer?
Well what can you do with a NextGen Sequencer? You can generate 500,000 to 1 million long DNA sequences on the 454. You can generate 50 to 200 million short DNA sequences on an Illumina or a SOLiD system. And one has whole genome capabilities. You can take your genome right now and you can generate the complete sequence with one run on one of the Illumina or SOLiD machines. But one can also look at defined portions of the genome for a more in-depth characterization because the capabilities of these machines are much greater than simply sequencing a whole genome.
For whole genomes, you can sequence multiple bacterial genomes on one Illumina or any other NextGen machine. I would say it is still not ready for prime time with more complex genomes simply because of the overall cost. And over the next couple of years, we’re probably going to see limited projected use for Mayo but there are 2 projects that are going on right now internationally. One is the 1,000 genome project that is a collaboration between scientists in the United States and Europe and there is a 100 genome project in China that is looking at 100 individuals of Chinese ancestry to see their sequence. While Mayo probably won’t do any of the sequencing, the data made from this will be available to analyze.
Detecting Rare Mutations
One of the key things is that all of these technologies actually generate from a half million to hundreds of millions of simultaneous sequences and this means that you could actually not just look for something that is present in all the sequences but is it a very rare event? Hence, a rare mutation or difference can be detected with this methodology. This offers tremendous capabilities for early cancer diagnostics. If we could look at the few genes that are mutated in cancer and ask “Are there a few molecules that are mutant in a sea of wild-type molecules?” Then various cancers could be detected by just taking a biological fluid: blood, urine, or stool. And, hence, body tests for cancer and disease are going to be standard care practice in the next couple of years. Instead of seeing a cancer when you have complications from the cancer and the doctor trying to treat a disease which has metastasized, we’re going to see the cancer before it has even begun to grow large enough to really produce a problem and this is going to change the entire face of cancer treatment.
Another capability is mate-pair reads. This was first described for large scale genomic analysis when the genome was being sequenced by Collin Collins. He would sequence the ends of 200,000 base pair pieces called bacterial artificial chromosomes (BAC) and with them you could actually see if there have been alterations in the sequence. Well the BACs were 100 to 200 kb in size; thus, thousands of the sequences were required. This was all done with Sanger sequencing and it offered 100 kb resolution. But with these NextGen sequencing, you are doing this with thousands of times more sequences and hence, one can look if there has been an insertion between 2 fragments, they’ll be too far apart but if there’s been a deletion they’ll be too close together. If there’s been a translocation, 2 pieces which should not be close together will be and hence this could actually replace cytogenetics and could be used for multiple diagnostic applications.
Hybrid selection is the idea that you perhaps do not want to sequence the entire genome but just a portion of the genome. So one could use oligonucleotide chips, the same types of chips that are used for microarray analysis, to hybridize and pull down the section of the genome that you want. That section might be all the 180,000 exons; it might be a defined 5 to 10 Mb region. And perhaps you take the genes that are known to be mutated in cancer to pull them down. You could sequence just the enriched portion of the genome and then there’s tremendous research potential to localize genes, mutations, or rare events.
As the capabilities of these machines increase, and it has been phenomenal how much they have increased, every year the capabilities of these machines increases 10-fold, you reach a point where it’s no longer necessary to put an entire sample onto one of these machines. You might want to mix samples together. Well, the capability of bar coding: all of these technologies put oligonucleotides on the end of the fragments. Well if you take an oligonucleotide with Sequence A and you put it for Person A, and you take another oligonucleotide with Sequence B for Person B, you can put A, B, C, through X all together in one sample and after the sequencing, you simply look for the bar code and you say all of these fragments came from Person A, all of these came from Person B, and you can deconvolute that and look at multiple individuals. So currently the bar codes are available with something like 8 different bar codes so you can put 8 samples per lane but that’s just going to increase. Ultimately you will be able to run hundreds of individuals on one of these sequencers to look what is happening. So, this increases as the throughput on NextGeneration sequences increases and the cost per sample eventually becomes trivial.
Clinical Diagnostics. With mate-pair reads, one can analyze cytogenetics and this will detect subtle changes. Cytogenetics has relatively low resolution but it has a very effective technique for seeing gross changes. Well you can see gross changes and very fine changes with this type of technology. One can sequence multiple genes involved in cancer, both the familial cancers as well as sporadic cancer, and the capability of early cancer detection for biological fluids is a very exciting area that a number of researchers at Mayo are pursuing as we speak.
These NextGen machines are just a few years old but we are already hearing about Next-NextGen sequencers. One of the limitations of the current version of NextGen sequencers is that they sequence individual sequences but they actually make many, many copies of it, somewhere between 10,000 and 1 million and the overall reagent cost for that is still high. Well, the Next Generation or Next-NextGen are single molecule sequencers. They can read long sequences. They are highly cost effective and they bring whole genome sequencing and other sequencing to very affordable ranges. They have extremely high sequence bandwidth; one company, Pacific Biosciences which is supposed to have a prototype out by 2010, is supposed to have capabilities that are 100 times greater than the fastest technologies we have today. Another company that’s coming out is Nanopore which has pores and different nucleotides move through those pores at different rates and other single molecule, fast sequencers are being developed even as we speak.
Center for Individualized Medicine (CIM)
One of the things we are setting up at Mayo to take advantage of this is the Center for Individualized Medicine. We are setting up data source labs which are going to be for high throughput/high content analysis of clinical specimens. We are going to link that to our existing cores for accessioning materials and for processing them; and then link together that, molecular biology, genotyping, microarray, and bioinformatics analysis to determine for a large number of samples what are the alterations and how are they associated with clinical parameters.
Generating a Data Source
We are generating a data source as we speak and this year we are going to begin testing capabilities of the available NextGen sequencers to generate complete genomic profiles of cancer samples. We are going to work out the kinks by 2010. We’ll be ready for barcoding at which case we can analyze 50+ samples per run. We’ll be able to run hundreds of cancer samples; all will be fresh frozen with clinical follow-up. We are going to develop a shared database of the diseases of cancer by mid-2010.
Capabilities Grow Exponentially
One of the most exciting things - and I cannot even draw on this curve where we were before 2006. Even with the fastest sequencer on the planet from the year 1999, that sequencing was not even on the graph. But in early 2006, the very first 454 sequencer came on the market which could do 20 Mb of sequences. By the end of that year it was up to 100 Mb. By 2007 when the Illumina Genome Analyzer came out, you had capabilities to do 1 GB of sequence. By the end of that year when the SOLiD system came out, you were up to 6 GB. In May, and unfortunately I have to change these slides every week to keep up with the technology, there was a sequence run that was 17 GBs. Today, the fastest run that I’ve heard of is 60 GBs of sequencing and by 2010 when Pacific Biosciences comes out, we will see sequencing on the order of 100 GB/hour. This will profoundly change our world because every child that is born can be sequenced when the cost for sequencing comes down to these ranges. In addition, the capabilities for looking at things: for looking at signatures becomes very possible and our world is going to change dramatically and one of the things Mayo is investing in very significantly is this technology so that we can keep abreast of where this technology is going.
So, What's Coming?
So what’s coming? Multiple gigabase and beyond sequencing capabilities. We are going to see increased accuracy of the base call as they improve the chemistry and the imaging. We’re going to have longer reads; hence, whole genomes can be sequenced quickly and inexpensively. We’ll have rare mutation detection ready for prime time so many cancers can be detected by looking for those rare mutations. We currently have 8-plex bar codes but it’s going to go up to 64-plex and beyond. Hence, the cost per sample will continue to drop.
Building the Infrastructure for NextGen (and Beyond) Sequencing
What we’re doing now is we’re building the infrastructure for this at Mayo. We need research expertise in many areas. Sample preparation. We also need a growing team actively involved in evaluating new technologies as they become available and we need a skilled and considerable bioinformatics team to analyze and make sense of the large amount of data which is generated.
I must thank the following who have all provided images which are used in this: 454/Roche, Illumina, and ABI/Life Technologies. Thank you.