簡易檢索 / 詳目顯示

研究生: 黃誼安
論文名稱: Utilizing BLAST to Extract Citation Metadata from Online Publication Lists
指導教授: 何建明
Ho, Jan-Ming
陳世旺
Chen, Shi-Wang
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2003
畢業學年度: 91
語文別: 中文
論文頁數: 46
中文關鍵詞: 基因BLAST著作表列目錄學後設資料
英文關鍵詞: gene, BLAST, publication list, bibliography, metadata
論文種類: 學術論文
相關次數: 點閱:209下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 科學家相互引用文獻和研究結果,是科學得以迅速發展的重要因素。因此,書目表單(citation list)或文獻目錄(bibliography)無疑是學者的重要工具。一般常見的書目(citation)資料,通常記載著作者(author)、標題(title)、出版資訊(publication information)等訊息。出版資訊隨著出版形式不同(例如書本、期刊、研討會論文集、叢書、研究報告、技術報告等),而有種種變化,其內容則包括期刊或研討會名稱、冊別、編號、頁數、出版年月、出版商、出版地點等。這些扼要描述文獻背景訊息的後設資料(metadata),通常有結構化(structured)和半結構化(semi-structured)等兩種呈現形式。結構化的書目,可以資料庫或欄位式的表單作為代表;半結構化的文獻目錄,則以連續字串的形式呈現,其形式比較自由。因此,不同的學者在描述同一筆文獻的時候,可能會寫出兩筆外觀看來很不一致的書目資料。不止後設資料屬性的前後次序會有變化,連使用到的屬性也可能有所不同。
    然而出現在網路上的文獻目錄,絕大多數卻都屬於半結構化的形式。若要加值運用,就得先將半結構化的文獻目錄,剖析和轉換成為一致的結構化形式,並分析彼此參照的關係和建立索引,以提供文獻搜尋和引用統計等資訊服務。本論文擬探討如何將半結構化文獻目錄,轉換成為一致的結構化資料。這是書目資料處理的核心問題。
    由於書目資料型態眾多,想要自動將半結構化的書目轉換成結構化的資料實為不易。為了辨識書目後設資料,我們的基本構想是運用基因比對技術來解決這個書目資料辨識的問題。也就是將半結構化書目轉成蛋白質序列(protein sequence)。將已知的書目資料的樣板,則轉換成蛋白質序列,儲存於樣板資料庫中(template database)。當必須解析新的半結構化的書目時,則可將新的書目轉換成蛋白質序列。再以BLAST這項序列比對工具,從事先建立好的樣板資料庫中,找出與該蛋白質序列最相近的樣板。最後根據此樣板作後設資料的解析。
    這樣的處理方式讓系統更有彈性,不僅可以輕易加入新的書目樣板,也可以快速找到最相近的樣板作為解析後設資料的依據。解析結果的準確率會因樣本資料庫的完整度而有所不同,也會因為計分表的設計而有所偏差,更會因測試資料的型態不同(例如含中文姓氏的著作表列與不含中文姓氏的著作表列)而形成不一樣的結果。本論文在這些議題上作了一些測試,在最理想的狀況下本系統可以達到91.2%的準確率,而OpCit的系統準確率在理想狀況下卻僅能達到75%。相反的在樣板資料庫完整度低的情況下(樣板完整度百分之五十),而且使用不利的測試資料,本系統的準確率降到38.2%,而OpCit系統為6%。

    It is an important factor for scientific research developing rapidly that scientists cite documents or research results between each other. Therefore, it is undoubted that citation lists or bibliographies are important tools to scholars. The common information of citation data usually contains messages of author, title, and publication information. The publication information has a variety of variations according to the publishing types (E.g. book, journal paper, conference paper, series, research report, technical report, etc.). A publication information contains journal name or conference name, volume, number, page, publish year, publish month, publisher, and publisher’s address, etc. The metadata, briefly describing the background messages of bibliographies, is often presented in structured form or in semi-structured form. The structured citation data can be represented by database or field table; and the semi-structured citation data are presented in the form of successive words. The form of semi-structured citation is more flexible. Hence, a bibliography described by different scholars may be written as two citation data which are inconsistent in appearance. Not only the order of metadata may be changed, but also the attributes used may be different.
    However, the bibliographies appeared in the Internet are mostly presented in the semi-structured form. If we want to utilize the information of citation data, we must transform the semi-structured bibliography into structured bibliography fist. We have to analyze the relationship of each citation data and build up an index for the services of bibliography search and citing statistics. In this paper, we plan to discuss how to transform semi-structured bibliographies into uniform structured data, which is the core problem of citation data processing.
    Due to the numerous models of citation data, it is hard to transform the semi-structured citation data into structured data automatically. In order to recognize the citation metadata, in our basic conception, we utilize the technique of gene sequence search alignment to resolve the problem of citation data recognition. The known semi-structured citation data are transformed into protein sequence, and saved in the template database. When a new semi-structured citation data is going to be parsed, we can translate the new citation data into protein sequence, and then utilize, BLAST a sequence alignment tool, to find a template which is most similar to the protein sequence from the template database established beforehand. Finally, we can parse metadata according to the template.
    It is more flexible in our system to operate in such way. We can not only add new citation template easily, but also can rapidly search the most similar template to parse the metadata. The precision of parsing result will be different for the completeness of template database, the design of scoring table, and the type of the test data (E.g. the publication lists with Chinese surname, or the publication lists without Chinese surname). In this paper, we do some experiments in these issues. In the ideal condition, the precision of our system can reach 91.2%, but the precision of OpCit system is only 75%. When the completeness of template is low (completeness of template about 50%), using a bad test data set, the precision of our system is down to 38.2% and the precision of OpCit system is 6%.

    目錄 第一章 緒論 …………………………………………………………1 1.1 研究動機 ……………………………………………………1 1.2 問題敘述 ……………………………………………………2 1.3 論文的結構 …………………………………………………3 第二章 相關文獻探討 ………………………………………………7 2.1 書目資料自動索引工具 ……………………………………7 2.1.1 OpCit ……………………………………………………7 2.1.2 CERN ……………………………………………………8 2.2 基因序列比對工具 …………………………………………9 第三章 系統架構之設計 …………………………………………12 3.1 系統流程 ……………………………………………………12 3.2 知識背景 ……………………………………………………13 3.3 基因序列轉換 ………………………………………………15 3.4 計分表(Scoring table) ………………………………………17 3.5 樣板(template) ………………………………………………18 3.6 Pattern extraction ……………………………………………19 第四章 實驗與結果 ………………………………………………24 4.1 測試資料產生………………………………………………24 4.2 實驗設計……………………………………………………25 4.3 結果…………………………………………………………27 第五章 結論與未來展望 …………………………………………41 5.1 結論…………………………………………………………41 5.2 未來展望……………………………………………………42 參考文獻………………………………………………………………44 附錄 附錄 A 本研究測試用之含中文姓氏的著作表列 ………………A-1 附錄 B 本研究測試用之不含中文姓氏的著作表列 ……………B-8 圖表目錄 圖目錄 圖1.1 著作輸入範例 …………………………………………………5 圖1.2 BibTeX格式範例 ………………………………………………5 圖1.3著作表列…………………………………………………………6 圖3.1 系統流程圖 ……………………………………………………21 圖3.2 解析結果輸出 …………………………………………………21 圖3.3 系統計分表 ……………………………………………………22 圖3.4 樣板資料庫 ……………………………………………………22 圖 3.5 Citation基因轉換陣列…………………………………………23 圖4.1 本系統以10 cross-validation測試含中文姓氏著作表列準確率分佈圖 …………………………………………………………32 圖4.2 本系統以5 cross-validation測試含中文姓氏著作表列準確率分佈圖 ……………………………………………………………32 圖4.3 本系統以2 cross-validation測試含中文姓氏著作表列準確率分佈圖 ……………………………………………………………33 圖4.4 OpCit以10 cross-validation測試含中文姓氏著作表列準確率分佈圖 ……………………………………………………………33 圖4.5 OpCit以5 cross-validation測試含中文姓氏著作表列準確率分佈圖 ………………………………………………………………34 圖4.6 OpCit以2 cross-validation測試含中文姓氏著作表列準確率分佈圖………………………………………………………………34 圖4.7 BLOSUM62計分表……………………………………………35 圖4.8 二分法計分表…………………………………………………35 圖4.9 本系統理想狀況下解析中文著作表列的準確率分佈圖……36 圖4.10 OpCit系統使用本實驗測試資料所得到的準確率分佈圖 …37 圖4.11: 姓氏資料庫對系統資料庫的影響 ………………………37 圖4.12: 以不同完整度的樣板資料庫測試含中文姓氏著作表列以及不含中文姓氏著作表列………………………………………………37 圖4.13:在含中文姓氏的著作表列中求Topi………………………40 圖4.14:在不含中文姓氏的著作表列中求Topi……………………40 表目錄 表4.1 本系統k cross-validation各組資料結果………………………30 表4.2 OpCit k cross-validation各組資料結果…………………………31 表4.3 使用不同計分表所得到的準確率 ……………………………36 表4.4: 本系統解析結果品質…………………………………………39 表4.5: OpCit解析結果品質…………………………………………39

    [1] Brooks, T. A. "Evidence of complies citer motivations," Journal of the American Society for Information Science, Vol. 37, No. 1 , pp. 34-36, 1986.

    [2] Eugene Garfield, "Can citation indexing be automated?," In M. E. Stevens et al. eds. Statistical Association Methods for Mechanized Documentation, Washington, DC: National Bureau of Standards, 1965.

    [3] Kurt D. Bollacker, Steve Lawrence, and C. Lee Giles, "CiteSeer: An Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications," 2nd International ACM Conference on Autonomous Agents, pp. 116-123, ACM Press, May, 1998.

    [4] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence, "CiteSeer: An Automatic Citation Indexing System," Digital Libraries 98 Pittsburgh PA USA.

    [5] Steve Lawrence, C. Lee Giles and Kurt D. Bollacker, "Autonomous Citation Matching," Proceedings of the Third International Conference on Autonomous Agents, Seattle, Washington, May 1-5, ACM Press, New York, NY, 1999.

    [6] Steve Lawrence, C. Lee Giles and Kurt D. Bollacker, "Digital Libraries and Autonomous Citation Indexing," IEEE Computer, Vol. 32, No. 6, pp. 67-71, 1999.

    [7] Peter N. Yianilos and Kirk G. Kanzelberger, "The LIKEIT Intelligent String Comparison Facility," NEC Research Institute Technical Report, May, 1997.

    [8] Harnad, Stevan and Carr, Leslie, "Integrating, Navigating and Analyzing Eprint Archives Through Open Citation Linking (the OpCit Project)," Current Science (special issue honour of Eugene Garfield), Vol. 79, pp. 629-638, 2000.

    [9] Donna Bergmark, "Automatic Extraction of Reference Linking Information from Online Documents," Cornell University Technical Report, TR 2000-1821, November, 2000.

    [10] Donna Bergmark and Carl Lagoze, "An Architecture for Automatic Reference Linking," Cornell University Technical Report, TR 2000-1820, October, 2000.

    [11] Mike Jewell, "ParaTools Reference Parsing Toolkit - Version 1.0 Released," D-Lib Magazine, Vol. 9, No.2, Feb., 2003.

    [12] Jean-Blaise Claivaz, Jean-Yves Le Meur and Nicholas Robinson, "From Fulltext documents to structured citations: CERN’s automated solution," Help Libraries Webzine, Issue 5, Nov., 2001.

    [13] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequences of two proteins," J. Mol. Biol., Vol. 48, pp. 443-453, 1970.

    [14] T. F. Smith and M. S Waterman, "Identification of common molecular sequences," J. Mol. Biol., Vol. 197, pp. 723-728, 1981.

    [15] W. R. Pearson and D. Lipman, "Improved tools for biological sequence comparison," Proc. Natl. Acad. Sci.,Vol. 85, pp. 2444-2448, 1988.
    [16] S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman, "A basic local alignment search tool,"J. Mol. Biol., Vol. 215, pp. 403-410, 1990.

    [17] Eugene G. Shpaer, Max Robinson, David Yee, James D. Candlin, Robert Mines, and Tim Hunkapiller, "Sensitivity and Selectivity in Protein Similarity Searches: Comparison of Smith-Waterman in Hardware," Genomics, Vol. 38, 179-191, 1996.

    [18] S Henikoff and JG Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Natl. Acad. Sci., USA, Nov. 15, Vol. 89, No. 22, pp. 10915–10919, 1992.

    [19] http://paracite.eprints.org/

    QR CODE