簡易檢索 / 詳目顯示

研究生: 朱德清
論文名稱: 以計算方式研究基因體結構與變異
Computational Methods for Studying Genome Organization and Variation
指導教授: 李忠謀
Lee, Chung-Mou
施純傑
Shih, Chun-Chieh
學位類別: 博士
Doctor
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 62
中文關鍵詞: 基因體結構變異短序列定序次世代定序基因體重組基因體反轉基因體位移
英文關鍵詞: genome structural variation, short read sequencing, next-generation sequencing, genome assembly, genomic inversion, genomic transposition
論文種類: 學術論文
相關次數: 點閱:225下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • DNA定序技術在生物研究中扮演著日益重要的角色,透過將生物體DNA定序資料重組還原其基因體,可以獲得許多與該生物體相關的資訊。另外透過比較不同物種或生物個體的基因體,基因體上的結構性變異(structural variation)也被發現對於基因表現、疾病和演化有重要的影響。近幾年來,隨著定序技術的進步,發展出許多新的短序列定序平台,這些平台可以用非常低的成本得到大量的定序資料。然而,龐大的資料量與短的資料長度在使用計算方法還原基因體的問題上,帶來許多挑戰。除此之外,生物的基因體常常含有複雜的結構性變異,像是反轉(inversion)和位移(transposition),傳統的序列比對演算法無法完整的比對含有結構性變異的序列,對於發生變異的位置也無法得知。因此,我們提出新的計算方法,能夠使用短序列定序資料重組基因體,以及比對兩組含有結構性變異的序列。
    在重組基因體方面,我們提出了一個新的演算法,藉由採用”跳躍延伸” (jumping extension)的方式有效的將短序列定序資料重組還原其基因體。跳躍延伸的主要概念是透過定序資料兩層的重疊關係,將龐大定序資料中較有可能屬於同一區域的定序資料過濾出來,進而重組該區域的基因體內容。實驗結果顯示,本論文所提出的演算法相較於其他目前常見的方法不僅能得到較佳的重組結果,在記憶體的使用量上也相較的低。
    在序列比對方面,我們提出了一個新的序列比對演算法來比對兩條含有反轉或位移的序列。透過比對包含反轉或位移事件斷點區域(breakpoint region)的序列,我們可以估計發生事件斷點的位置,進而還原事件發生之前的序列內容,再透過傳統的序列比對方法得到完整的比對結果。藉由所提出的演算法,我們分析UCSC網站上人類與黑猩猩的基因體資料,得到130處反轉事件的斷點與846處位移事件的斷點,以及該區域序列的完整比對結果。另外透過模擬的方式,我們試驗提出的方法在比對不同親緣關係物種序列的效果。實驗結果顯示所提出的方法適用於比對親緣關係高於大鼠(rat)與小鼠(mouse)的物種序列。
    綜合以上所述,本論文所提出的計算方法能有效的使用短序列定序資料重組基因體,進而完整的比對含有反轉和位移的序列。除此之外,發生反轉以及位移事件的斷點位置也能夠被偵測出來。

    DNA sequencing is an important technique in biological studies. With the sequencing data by genome assembly, a lot of useful information of an organism’s genome, such as the size, DNA composition, and contents, can be obtained. By comparing genomes between species or individuals, structural variation (SV) has been found that play an important role in changing gene expression, diseases, and genome evolution. With the great progress of sequencing technologies, several short read sequencing platforms have emerged in the past few years. These platforms can generate huge amounts of data with much lower cost than by the traditional Sanger sequencing although the sequencing lengths are much shorter. The vast amounts of sequencing data and the short read length pose many computational challenges in genome assembly. Further, when comparing genomes that contain complex rearrangement SVs, such as inversion and transposition, the genomic sequences cannot be aligned well by traditional alignment algorithms to identify the breakpoints and recover the rearrangements directly. Therefore, we propose two new computational methods to reconstruct a sequenced genome using short read sequencing data and to detect the breakpoints of inversion or transposition events in the sequence for genome comparison.
    In the first work, we propose a genome assembly algorithm which adopts a new extension approach called “jumping extension” for assembling reads ≥100 bp efficiently. The jumping extension is the kernel of our proposed method that can group the reads that are more likely to be sequenced from the same region and extends more than one hundred bases at one time. During the read extension, dynamically trimming low quality nucleotides from the 3'-end of a read improves the connectives of the reads. Empirical and simulation studies reveal that the proposed algorithm achieves not only better contig quality but also better memory usage than many popular methods.
    In the second work, we propose a pairwise alignment algorithm which can align sequences containing rearrangement (inversion/transposition) events. The breakpoints of the rearrangement events are estimated by the alignment of the breakpoint regions. Then, one sequence is un-shuffled according to the corresponding breakpoints and the alignment result of sequences before the possible rearrangement events occurred can be obtained. We have identified 130 simple inversion breakpoints and 846 simple transposition breakpoints between human and chimpanzee genome using the data from UCSC website. We also evaluate the method on several pairs of species and the result shows that the method is suitable for species that are as conserved as mouse and rat.
    In this dissertation, we develop a series of computational methods to study the genome organization and variation. The proposed methods can efficiently reconstruct an organism’s genome from short read sequencing data and detect the breakpoints of inversion and transposition at nucleotide level. In the future, the methods can be applied to the sequencing data obtained from different platforms. Besides, based on our study, we can develop new methods to cope with more complex genome variations that are made up with combinations of SVs.

    Chapter 1 Introduction .................................................................... 1 1.1 Genome sequencing, assembly, and structural variation ................................. 1 1.1.1 Impacts of genome assembly and structural variations ................................ 3 1.1.2 Computational challenges to genome assembly and structural variation detection ...... 4 1.2 Motivation ............................................................................ 5 1.3 Objectives ............................................................................ 6 1.4 Organization of this dissertation ..................................................... 6 Chapter 2 Background ...................................................................... 7 2.1 Genome assembly ....................................................................... 7 2.2 Detection of structural variations .................................................... 8 2.2.1 Computational methods for detection of structure variation .......................... 8 2.3 Limitations of previous methods ....................................................... 9 Chapter 3 Genome assembly of short sequence reads ........................................ 11 3.1 Introduction ......................................................................... 11 3.2 Methods .............................................................................. 12 3.2.1 Build unique read table ............................................................ 13 3.2.2 Seed selection ..................................................................... 14 3.2.3 Seed extension ..................................................................... 15 3.2.4 Repeat detection ................................................................... 17 3.2.5 Contig merging ..................................................................... 19 3.3 Results and discussion ............................................................... 20 3.3.1 Comparison with current methods using SRS data from small and medium size genomes .. 20 3.3.2 Effects of repeat, coverage, and sequencing error on assembly quality .............. 25 3.3.3 Assembly of large genomes .......................................................... 28 3.3.4 Strategies of seed selection ....................................................... 28 3.3.5 Memory usage of JR-Assembler ....................................................... 29 3.4 Summary .............................................................................. 30 Chapter 4 Aligning pairwise genomic sequences containing rearrangement events ............ 32 4.1 Introduction ......................................................................... 32 4.2 Methods .............................................................................. 33 4.2.1 Match identification and merging ................................................... 34 4.2.2 Inversion identification and un-shuffling .......................................... 36 4.2.3 Transposition identification and un-shuffling ...................................... 39 4.3 Results and discussion ............................................................... 41 4.3.1 Simulation results ................................................................. 41 4.3.2 Human and chimpanzee genomic sequences ............................................. 44 4.4 Summary .............................................................................. 46 Chapter 5 Conclusions .................................................................... 47 5.1 Contributions ........................................................................ 47 5.2 Future works ......................................................................... 47 Bibliography ............................................................................. 49 List of Publications ..................................................................... 53

    [1] F. Sanger, et al., "DNA sequencing with chain-terminating inhibitors," Proc Natl Acad Sci U S A, vol. 74, pp. 5463-7, Dec 1977.
    [2] E. S. Lander, et al., "Initial sequencing and analysis of the human genome," Nature, vol. 409, pp. 860-921, Feb 15 2001.
    [3] R. H. Waterston, et al., "Initial sequencing and comparative analysis of the mouse genome," Nature, vol. 420, pp. 520-62, Dec 5 2002.
    [4] M. L. Metzker, "Sequencing technologies - the next generation," Nat Rev Genet, vol. 11, pp. 31-46, Jan 2010.
    [5] L. Feuk, et al., "Structural variation in the human genome," Nat Rev Genet, vol. 7, pp. 85-97, Feb 2006.
    [6] M. E. Hurles, et al., "The functional impact of structural variation in humans," Trends Genet, vol. 24, pp. 238-45, May 2008.
    [7] B. E. Stranger, et al., "Relative impact of nucleotide and copy number variation on gene expression phenotypes," Science, vol. 315, pp. 848-53, Feb 9 2007.
    [8] E. Gonzalez, et al., "The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility," Science, vol. 307, pp. 1434-40, Mar 4 2005.
    [9] M. Fanciulli, et al., "FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity," Nat Genet, vol. 39, pp. 721-3, Jun 2007.
    [10] F. C. Chen, et al., "Genomic divergence between human and chimpanzee estimated from large-scale alignments of genomic sequences," J Hered, vol. 92, pp. 481-9, Nov-Dec 2001.
    [11] R. J. Britten, "Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels," Proc Natl Acad Sci U S A, vol. 99, pp. 13633-5, Oct 15 2002.
    [12] C. Alkan, et al., "Genome structural variation discovery and genotyping," Nat Rev Genet, vol. 12, pp. 363-76, May 2011.
    [13] D. R. Zerbino and E. Birney, "Velvet: algorithms for de novo short read assembly using de Bruijn graphs," Genome Res, vol. 18, pp. 821-9, May 2008.
    [14] M. J. Chaisson, et al., "De novo fragment assembly with short mate-paired reads: Does the read length matter?," Genome Res, vol. 19, pp. 336-46, Feb 2009.
    [15] J. T. Simpson, et al., "ABySS: a parallel assembler for short read sequence data," Genome Res, vol. 19, pp. 1117-23, Jun 2009.
    [16] R. Li, et al., "De novo assembly of human genomes with massively parallel short read sequencing," Genome Res, vol. 20, pp. 265-72, Feb 2010.
    [17] J. O. Korbel, et al., "PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data," Genome Biol, vol. 10, p. R23, 2009.
    [18] K. Chen, et al., "BreakDancer: an algorithm for high-resolution mapping of genomic structural variation," Nat Methods, vol. 6, pp. 677-81, Sep 2009.
    [19] S. Lee, et al., "MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions," Nat Methods, vol. 6, pp. 473-4, Jul 2009.
    [20] P. J. Campbell, et al., "Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing," Nat Genet, vol. 40, pp. 722-9, Jun 2008.
    [21] D. Y. Chiang, et al., "High-resolution mapping of copy-number alterations with massively parallel sequencing," Nat Methods, vol. 6, pp. 99-103, Jan 2009.
    [22] K. Ye, et al., "Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads," Bioinformatics, vol. 25, pp. 2865-71, Nov 1 2009.
    [23] M. Brudno, et al., "Glocal alignment: finding rearrangements during alignment," Bioinformatics, vol. 19 Suppl 1, pp. i54-62, 2003.
    [24] A. C. Darling, et al., "Mauve: multiple alignment of conserved genomic sequence with rearrangements," Genome Res, vol. 14, pp. 1394-403, Jul 2004.
    [25] J. R. Miller, et al., "Assembly algorithms for next-generation sequencing data," Genomics, vol. 95, pp. 315-27, Jun 2010.
    [26] W. Zhang, et al., "A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies," PLoS One, vol. 6, p. e17915, 2011.
    [27] R. L. Warren, et al., "Assembling millions of short DNA sequences using SSAKE," Bioinformatics, vol. 23, pp. 500-1, Feb 15 2007.
    [28] W. R. Jeck, et al., "Extending assembly of short DNA sequences to handle error," Bioinformatics, vol. 23, pp. 2942-4, Nov 1 2007.
    [29] D. Hernandez, et al., "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer," Genome Res, vol. 18, pp. 802-9, May 2008.
    [30] B. Schmidt, et al., "A fast hybrid short read fragment assembly algorithm," Bioinformatics, vol. 25, pp. 2279-80, Sep 1 2009.
    [31] A. Morgulis, et al., "A fast and symmetric DUST implementation to mask low-complexity DNA sequences," J Comput Biol, vol. 13, pp. 1028-40, Jun 2006.
    [32] D. R. Kelley, et al., "Quake: quality-aware detection and correction of sequencing errors," Genome Biol, vol. 11, p. R116, 2010.
    [33] M. Boetzer, et al., "Scaffolding pre-assembled contigs using SSPACE," Bioinformatics, vol. 27, pp. 578-9, Feb 15 2011.
    [34] M. C. Schatz, et al., "Assembly of large genomes using second-generation sequencing," Genome Res, vol. 20, pp. 1165-73, Sep 2010.
    [35] I. Milne, et al., "Tablet--next generation sequence assembly visualization," Bioinformatics, vol. 26, pp. 401-2, Feb 1 2010.
    [36] E. Lyons and M. Freeling, "How to usefully compare homologous plant genes and chromosomes as DNA sequences," Plant J, vol. 53, pp. 661-73, Feb 2008.
    [37] T. Zimmermann, et al., "Cloning and characterization of the promoter of Hugl-2, the human homologue of Drosophila lethal giant larvae (lgl) polarity gene," Biochem Biophys Res Commun, vol. 366, pp. 1067-73, Feb 22 2008.
    [38] J. Pei and N. V. Grishin, "PROMALS: towards accurate multiple sequence alignments of distantly related proteins," Bioinformatics, vol. 23, pp. 802-8, Apr 1 2007.
    [39] M. Tomomura, et al., "Structural and functional analysis of the apoptosis-associated tyrosine kinase (AATYK) family," Neuroscience, vol. 148, pp. 510-21, Aug 24 2007.
    [40] J. E. Janecka, et al., "Molecular and genomic data identify the closest living relative of primates," Science, vol. 318, pp. 792-4, Nov 2 2007.
    [41] Y. Wang, et al., "Horizontal transfer of genetic determinants for degradation of phenol between the bacteria living in plant and its rhizosphere," Appl Microbiol Biotechnol, vol. 77, pp. 733-9, Dec 2007.
    [42] K. Goyal, et al., "Multiple gene duplication and rapid evolution in the groEL gene: functional implications," J Mol Evol, vol. 63, pp. 781-7, Dec 2006.
    [43] C. D. Town, et al., "Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation, and dispersal after polyploidy," Plant Cell, vol. 18, pp. 1348-59, Jun 2006.
    [44] T. A. Tatusova and T. L. Madden, "BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences," FEMS Microbiol Lett, vol. 174, pp. 247-50, May 15 1999.
    [45] S. F. Altschul, et al., "Basic local alignment search tool," J Mol Biol, vol. 215, pp. 403-10, Oct 5 1990.
    [46] S. F. Altschul, et al., "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res, vol. 25, pp. 3389-402, Sep 1 1997.

    [47] "Initial sequence of the chimpanzee genome and comparison with the human genome," Nature, vol. 437, pp. 69-87, Sep 1 2005.
    [48] F. C. Chen, et al., "Human-specific insertions and deletions inferred from mammalian genome sequences," Genome Res, vol. 17, pp. 16-22, Jan 2007.

    下載圖示
    QR CODE