Reading into the Future: Development of Long-read DNA Sequencing

By Aditi Goyal, Genetics and Genomics, ‘22

At this moment, the next revolution in the field of biology is currently underway: third-generation sequencing, or Long-Read sequencing. Instead of relying on cluster-based short read technology (1), third-generation sequencing builds a DNA sequence on a nucleotide basis, therefore eliminating the extensive process of read alignment.

Until now, scientists across the world have been heavily relying on Next Generation Sequencing (NGS) for getting DNA sequences. This technology creates clusters of short DNA sequences, which range anywhere from 50 to 150 base pairs in length, by using fluorescent nucleotides (2). It is often referred to as sequencing by synthesis because a DNA sequence is created by tracking which nucleotides are being used to build the parallel strand. NGS has served the scientific community well, providing extremely high coverage and high accuracy reads, as well as slashing the cost and time to sequence an entire genome (2). However, the drawbacks are just as serious. While NGS is a fantastic candidate for bacterial or archaeal genomes, it fails to capture the complexity of eukaryotic genomes. About half of the human genome is comprised of repeated sequences (2). Currently, the function of these repeated regions remains unclear, partially due to the fact that it is not possible to get an accurate DNA sequence of these areas using short-read sequences. With a maximum read size of 150 base pairs for NGS, there are too many potential matches for a read that small for scientists to accurately assign that read to a region in the genome. Another major problem is the quality of each read. While the technology itself is very accurate, there are several sources of error that quickly cause the quality of each read to deteriorate, such as biases during the PCR of mixtures, polymerase errors, base misincorporation, cluster amplification errors, sequencing cycle errors, and incorrect image analysis (3). All these errors result in about 1% of bases being read incorrectly, which, when applied to a 3 billion base pair genome, can be incredibly damaging.

This is why long-read sequencing is such a breakthrough. By analyzing a DNA sequence from nucleotide to nucleotide, scientists can build considerably longer reads with a much higher confidence level as compared to NGS. Ideally, with this technology, scientists will be able to produce de novo whole genome sequences for patients with genetic disorders, allowing them to understand the root of their disease at an unmatched resolution. This could pave the way to accurately diagnosing and curing complex genetic diseases. In the last few years alone, several papers have been published on the impacts of long-read sequencing investigating diseases such as Parkinson’s disease, fragile X syndrome, Alzheimers, and ALS (11). Other applications include improving our understanding of human genetic diversity. Recent studies show that the reference human genomes available today do not accurately represent humanity at a global level, but rather significantly overrepresent people of european descent (12). With the rise of long-read sequencing, it will be easier and cheaper to fully sequence a human genome, allowing us to expand the resources available and accurately reflect the human population.

There are currently several companies researching long-read sequencing, however, the most promising company appears to be Pacific Biosciences (Pac Bio) due to their development of single molecule real time sequencing (SMRT) (4, 5).

There are 2 key inventions that allow for the success of SMRT. The first is the fluorescent tagging.

Like with NGS, each nucleotide is modified to fluoresce a certain color, indicating which nucleotide it is, however with SMRT, the fluorescence is linked to the terminal phosphate of a nucleotide, instead of the base itself (8). Also similar to the NGS, the complementary strand continues to build. Now, when the DNA polymerase cleaves off the terminal phosphate, it releases the fluorescent group, which allows us to track which nucleotide was incorporated based on the color of the fluorescent.

The second innovation is the zero-mode waveguide (ZMW). The ZMW is a small nano chamber that contains the DNA sample during the sequencing process. It passes refracted light through so that the fluorescence of the nucleotides can be seen. This technology essentially acts as a microscope, allowing us to gain a powerful resolution of the DNA structure. Each ZMW can recognize over 10 base pairs per second with extreme accuracy. Additionally, given the ability for these ZMWs to be run in parallel, thousands of chambers can be sequenced at the same time, allowing for a fast cycle and long reads.

The advantages of SMRT are clear: it allows for long reads to be built. This means that scientists will have the ability to understand the overall complexity of large eukaryotic genomes. Another advantage is the speed and portability of the technology. Once it is completely developed, SMRT will be able to sequence an entire human genome in under 3 minutes for less than $100 in a device the size of a flash drive, a stark difference from today’s estimate (9).

Like any novel technology, there are some challenges that must be overcome before SMRT can be used commercially. The most pressing is concerns over accuracy. Individual reads can contain 11-14% errors on average, dragging the quality score of the read down. However, developers have noticed that these errors occur at random across the genome. By using a 10x coverage method, 9 out of 10 times, SMRT will provide the correct sequence for that point, which allows the accuracy to rise to approximately 99.99%.

Overall, SMRT is a revolutionary development that will soon change the way we understand biology. It will allow us to gain a holistic understanding of complex eukaryotic genome and will provide a higher resolution of the genome that we can use for further analysis.

References

“Illumina Sequencing Technology.” Illumina, October 11, 2010. https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf.
Treangen, Todd J, and Steven L Salzberg. “Repetitive DNA and next-Generation Sequencing: Computational Challenges and Solutions.” Nature reviews. Genetics. U.S. National Library of Medicine, November 29, 2011. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3324860/.
Fox, Edward J, Kate S Reid-Bayliss, Mary J Emond, and Lawrence A Loeb. “Accuracy of Next Generation Sequencing Platforms.” Next generation, sequencing & applications. U.S. National Library of Medicine, 2014. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331009/.
Check Hayden, Erika. “Genome Sequencing: the Third Generation.” Nature News. Nature Publishing Group, February 6, 2009. https://www.nature.com/news/2009/090206/full/news.2009.86.html.
Check Hayden, Erika. “Genome Sequencing: the Third Generation.” Nature News. Nature Publishing Group, February 6, 2009. https://www.nature.com/news/2009/090206/full/news.2009.86.html.
Eid, John, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, et al. “Real-Time DNA Sequencing from Single Polymerase Molecules.” Science. American Association for the Advancement of Science, January 2, 2009. https://science.sciencemag.org/content/323/5910/133.
“Video: Introduction to SMRT Sequencing.” PacBio. Accessed November 7, 2019. https://www.pacb.com/videos/video-introduction-to-smrt-sequencing/.
“Single Molecule Real Time Sequencing – Pacific Biosciences.” YouTube. YouTube. Accessed November 7, 2019. https://www.youtube.com/watch?v=v8p4ph2MAvI.
Schadt, Eric E., Steve, Andrew, and Turner. “Window into Third-Generation Sequencing.” OUP Academic. Oxford University Press, September 21, 2010. https://academic.oup.com/hmg/article/19/R2/R227/641295.
Roberts1, Richard J, Mauricio, and Michael C Schatz3. “The Advantages of SMRT Sequencing.” Genome Biology. BioMed Central, July 3, 2013. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-7-405.
Martin O Pollard, Deepti Gurdasani, Alexander J Mentzer, Tarryn Porter, Manjinder S Sandhu, Long reads: their purpose and place, Human Molecular Genetics, Volume 27, Issue R2, 01 August 2018, Pages R234–R241, https://doi.org/10.1093/hmg/ddy177