簡易檢索 / 詳目顯示

研究生: 張祐誠
Chang, Yu-Cheng
論文名稱: 使用梯度提昇機辨認暗網市場之毒品高衝擊賣家
Identifying High Impact Drug Sellers in Dark Net Marketplaces Using Gradient Boosting Machine
指導教授: 侯文娟
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 61
中文關鍵詞: 暗網暗網市場購物網站藥物梯度提昇機
英文關鍵詞: darknet, dnm, marketplace, gbm, gradient boosting machine, XGBoost
DOI URL: http://doi.org/10.6345/NTNU202001470
論文種類: 學術論文
相關次數: 點閱:141下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究將各樣軟體建構成針對暗網購物市場 (DarkNet Marketplaces, 以下簡稱DNM) 的爬蟲,繞過身份認證、Cookie 過期、Crawler reject(robot.txt) 等機制,透過程式取得研究者需要的 HTML 檔,再交由 Jsoup 函式庫剖析需要的網頁欄位,轉成Json 格式,儲存在本機的資料庫,並透過內建的 cURL 指令存取Elasticsearch (以下簡稱 ES),方便後續維護、備份、以及訓練之自動化。

    取得的資料,進一步以梯度提昇機 (Gradient Boosting Machine) 的決策樹機器學習訓練模型,擷取資料中的特徵,找出高衝擊的因素,嘗試預測每個 DNM 中的新貼文,未來可能的衝擊度,進而排列出此 DNM 中的賣家衝擊度排行。本研究嘗試使用藥物本身的生物半衰期作為衝擊度的依據,建構相關程式,從可信賴的網站中取得該藥物的半衰期,並將半衰期轉換為成癮度,作為該藥物對社會造成的量化衝擊。過去研究者曾針對鴉片類 (Opioid) 藥物做量化衝擊,以各類藥物相對於嗎啡的等效劑量 (potency) 作為其衝擊參考。然本研究欲探討較寬廣的藥物定義,故選擇生物半衰期作為量化衝擊。

    本研究透過 onion live 作為起點,選擇五個性質不盡相同的 DNM,嘗試建構一套不受限制的爬蟲架構,方便後續研究者取得資料。使用XGBoost 各別對每個 DNM 訓練 GBM 模型,從每個 DNM 中隨機取90%作為訓練資料,另外 10% 作為測試資料, 計算 Precision, Recall 以及 F1 score,可達到 95% 的 F1 分數。

    This research utilizes various software into a crawler for the DarkNet Marketplaces (DNM). The crawler first bypasses the authentication , cookie expiration, and crawler rejection (robot.txt) mechanism. It produces the HTML files needed by this research through the crawler and then we hand them over to the Jsoup library to analyze the required fields and convert them to the Json format. After that, we store the data in the localhost database Elasticsearch (ES) by curl commands to facilitate the subsequent maintenance, backup, and automation of training.

    The training data is further trained with the decision tree machine learning model of the gradient booster machine. The model is built by extracting the features of the data,
    finds out the high impact cause and tries to predict the possible impact in the future of the new posts in each DNM. Finally, the sellers are ranked in this DNM in terms of
    impact values.

    This research attempts to use the biological half-life of the drug itself as the basis for impact, constructs a program to obtain the half-life of the drug from some trusted website, and converts the half-life into the level of addiction, which is represented as the quantitative impact of the drug on society.

    In the past, researchers have made quantitative impacts on opioids, using the equivalent doses (potency) of various drugs relative to morphine as their impact reference. However, this study wants to explore a broader definition of drugs, so the biological half-life is selected as the quantitative impact.

    This research chooses onion live as a starting point and selects DNMs with different characteristics to try to construct a set of unrestricted crawler architecture to facilitate subsequent researchers to obtain data.

    This research uses XGBoost to train the GBM model for each DNM individually, randomly takes 90% of each DNM as training data, and the other 10% as testing data. The evaluation metrics are Precision, Recall and F1 score. An F1 score of 95% was achieved.

    1 簡介 1 2 相關研究探討 5 2.1 D-miner 5 2.2 Hidden service domain analysis 5 2.3 Identifying high impact opioid 6 2.4 Trojanized version of the torbrowser 7 2.5 DeepDotWeb 7 2.6 IRS-CI 9 3 資料集 11 3.1 IPv6 disable 12 3.2 Privoxy 13 3.2.1 listen address 13 3.2.2 socks 5 port forwarding 13 3.2.3 restart privoxy 14 3.3 torbrowser 14 3.4 Elasticsearch, ES 15 3.4.1 index API 15 3.4.2 delete API 16 3.4.3 search API 16 3.4.4 bulk API 16 3.4.5 max result window 17 3.5 Jsoup.jar 17 3.5.1 主要使用的 package 18 3.6 JAVA EE json 19 3.6.1 主要使用的 package 19 3.7 Python packages 19 4 爬蟲架構 21 4.1 取得連線到 DNM 需要的 header 21 4.2 使用 org.jsoup.Connection 取得 HTML 檔 23 4.3 從 Jsoup 透過 bulk API 匯入 ES 24 4.4 DNM 防堵爬蟲要素及其對應的對策 25 4.4.1 資料頁面遺失 25 4.4.2 帳號密碼登入與圖形驗證碼 (CAPTCHA) 26 4.4.3 定期 DDOS CAPTCHA 檢測 26 5 生物半衰期與商品衝擊 27 5.1 生物半衰期 (Biological half-life) 27 5.2 取得半衰期的方法 28 5.3 衝擊的定義 33 5.4 高衝擊的門檻 36 6 訓練模型 37 6.1 XGBoost 的理論基礎 37 6.2 資料清洗 43 6.3 對 categorical variable 的 one-hot encoding 處理 44 6.4 XGBoost 輸出之決策樹的解讀 44 6.5 弱分類器的提昇 46 6.6 分開訓練每個 DNM 47 6.7 設定高衝擊商品貼文數量 47 6.8 模擬預測新張貼的商品貼文 48 6.9 XGBoost 訓練的完成 48 7 實驗結果 49 7.1 分析造成實驗結果差異的原因 49 8 結論與未來發展 55 References 68

    [1] Clearnet (networking). Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Clearnet_(networking). [Accessed Aug. 5, 2020].
    [2] What actually is the Darknet. GDATA. [Online]. Available: https://www.gdatasoftware.com/guidebook/what-is-the-darknet-exactly. [Accessed Aug 5, 2020].
    [3] Overview. Tor Project. [Online]. Available: https://2019.www.torproject.org/about/overview.html.en. [Accessed Jan. 12, 2020].
    [4] Onion Service Protocol. Tor Project. [Online]. Available: https://2019.www.torproject.org/docs/onion-services.html.en. [Accessed Jan. 12, 2020].
    [5] Janis, D., Campbell, W., & Mark, C. (2018). Criminal motivation on the dark web: A categorisation model for law enforcement. Digital Investigation, vol. 24, pp.62-71. https://doi.org/10.1016/j.diin.2017.12.003
    [6] Heather, L., Andrew, H., Robert, T., & Cliff, Z. (2017). D-miner: A Framework for Mining, Searching, Visualizing, and Alerting on Darknet Events. 2017 IEEE Conference on Communications and Network Security(CNS). DOI: 10.1109/CNS.2017.8228628
    [7] Best CAPTCHA Solver Bypass Service. DEATH BY CAPTCHA. [Online]. Available: https://www.deathbycaptcha.com/user/login. [Accessed Aug. 10, 2020].
    [8] KIBANA Your window into the Elastic Stack. elastic. [Online]. Available: https://www.elastic.co/kibana. [Accessed Aug. 10, 2020].
    [9] Selenium has many projects that combine to form a versatile testing system. Selenium Projects. [Online]. Available: https://www.selenium.dev/projects/. [Accessed Aug. 10, 2020].
    [10] Po-Yi Du, Mohammadreza, E., Ning, Z., Hsinchun, C., & Randall, A. B. (2019). Identifying High-Impact Opioid Products and Key Sellers in Dark Net Marketplaces: An Interpretable Text Analytics Approach. 2019 IEEE International Conference on Intelligence and Security Informatics(ISI), pp. 110-115. DOI: 10.1109/ISI.2019.8823196
    [11] Tianqi. C., & Carlos. G. (2016). XGBoost: A Scalable Tree Boosting System. KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785-794. https://doi.org/10.1145/2939672.2939785
    [12] ONION.live. [Online]. https://onion.live/. Available: [Accessed Aug. 5, 2020].
    [13] Billy. B., (2019) Serious Dark Web Warning Issued After Tor Browser Users Have Bitcoin Stolen. Forbes. [Online]. Available: https://www.forbes.com/sites/billybambrough/2019/10/18/serious-dark-web-warning-issued-after-torbrowser-users-have-bitcoin-stolen/#2de79ab41b60. [Accessed Jan. 21, 2020].
    [14] DeepDotWeb. Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/DeepDotWeb. [Accessed Jan. 17, 2020].
    [15] Kelly. Phillips. E., IRS Followed Bitcoin Transcations, Resulting In Takedown Of The Largest Child Exploitation Site On The Web. Forbes. October 16, 2019. [Online]. Available: https://www.forbes.com/sites/kellyphillipserb/2019/10/16/irs-followed-bitcoin-transactions-resulting-in-takedown-of-the-largest-child-exploitation-site-on-the-web/#437b1601ed0d. [Accessed Feb. 1, 2020].
    [16] Nth room case. Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Nth_room_case. [Accessed Aug. 5, 2020].
    [17] Installing Elasticsearch. elastic. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html. [Accessed Aug. 5, 2020].
    [18] Download and install jsoup. jsoup. [Online]. Available: https://jsoup.org/download. [Accessed Aug. 5, 2020].
    [19] Java Platform, Enterprise Edition 8 SDK - Installation Instructions. Oracle. [Online]. Available: https://www.oracle.com/java/technologies/ee8-install-guide.html. [Accessed Aug. 6, 2020].
    [20] Installation Guide. XGBoost. [Online]. Available: https://xgboost.readthedocs.io/en/latest/build.html. [Accessed Aug. 6, 2020].
    [21] Jason, B., (2016) Data Preparation for Gradient Boosting with XGBoost in Python. [Online]. Machine Learning Mastery. Available: https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/. [Accessed Aug. 5, 2020].
    [22] Tianqi, C., Introduction to Boosted Trees. University Of Washington. [Online]. Available: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf. [Accessed May 26, 2020].

    下載圖示
    QR CODE