研究生: |
吳巧心 Wu, Chiao-Hsin |
---|---|
論文名稱: |
整合生成式人工智慧於多語言文本圖像客製化 Integration of Generative Artificial Intelligence Models: Multilingual Accurate Textual Image Customization |
指導教授: |
賴以威
Lai, I-Wei 伊藤一秀 Kazuhide ITO |
口試委員: |
賴以威
Lai, I-Wei 林政宏 Lin, Cheng-Hung 康立威 Kang, Li-Wei 伊藤一秀 Kazuhide ITO 宮崎隆彥 Takahiko MIYAZAKI 池本直樹 Naoki IKEGAYA 王冬 Wang, Dong |
口試日期: | 2025/01/07 |
學位類別: |
碩士 Master |
系所名稱: |
電機工程學系 Department of Electrical Engineering |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 69 |
中文關鍵詞: | 多模態大型語言模型 、擴散模型 、思維鏈 、文本生成圖像 |
英文關鍵詞: | Multimodal Large Language Model, Stable Diffusion, Chain-of-Thought, Text-to-Image |
研究方法: | 實驗設計法 |
論文種類: | 學術論文 |
相關次數: | 點閱:6 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
生成式人工智慧已迅速發展,能夠自動創建內容,對創意、交流及生產力產生深遠影響。然而,生成式人工智慧仍面臨諸多挑戰,尤其是在實現準確且可靠的文字圖像方面。當前的研究往往難以處理非拉丁字母的文字,並無法有效提高文字準確性,限制了生成包含文字的圖片的應用潛力。
為解決這些限制,本研究提出了多語言文本圖像客製化系統(Multilingual Accurate Textual Image Customization, MATIC),該系統結合了多模態大型語言模型與擴散模型並使用思維鏈的推理方法,將複雜任務分解為易於處理的子任務以提升多語言文字元素的準確性。
實驗結果表明MATIC 在多種語言的文字準確率超過 95%,表現優於現有模 型,且無論文本長度如何均能保持穩定的準確性。此外,此系統包含了增強圖像分析精確度的網格系統,能夠輔助學術研究中的圖像分析。綜合而言,這些創新使 MATIC 成為一種變革性工具,應用範疇涵蓋從學術研究到跨語言交流。
Generative Artificial Intelligence (GAI) has rapidly advanced, enabling autonomous content creation that impacts creativity, communication, and productivity. However, notable challenges remain, particularly in achieving accurate textual images. Current research often struggles with non-Latin scripts and fails to effectively improve text accuracy, leaving a gap in GAI’s applicability across diverse visual contexts.
To address these limitations, this study introduces the Multilingual Accurate Textual Image Customization (MATIC) system, which integrates Multimodal Large Language Models with diffusion models. Utilizing a Chain-of-Thought reasoning approach, MATIC decomposes complex tasks into manageable sub-tasks, enhancing the accuracy of multilingual textual elements.
Experimental results demonstrate that MATIC achieved over 95% text accuracy across multiple languages, outperforming existing models and maintaining consistent accuracy regardless of text length. Additionally, the system incorporates a grid system that enhances the precision of image analysis, offering valuable support for visual content in academic research. Together, these innovations position MATIC as a transformative tool, with broad applications ranging from advanced research to cross-linguistic communication.
Luís Pinto-Coelho. How artificial intelligence is shaping medical imaging technology: A survey of innovations and applications. Bioengineering, 10(12):1435, 2023.
Youssef Skandarani, Pierre-Marc Jodoin, and Alain Lalande. Gans for medical image synthesis: An empirical study. Journal of Imaging, 9(3):69, 2023.
Ygor Rebouças Serpa and Maria Andréia Formico Rodrigues. Towards machine-learning assisted asset generation for games: A study on pixel art spritesheets. In 2019 18th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), pages 182–191. IEEE, 2019.
Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Alec Radford. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. PMLR, 2021.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36, 2024.
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. arXiv preprint arXiv:2311.16465, 2023.
Shubham Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, and Lovekesh Vig. Customtext: Customized textual image generation using diffusion models. arXiv preprint arXiv:2405.12531, 2024.
Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, and Yu Qiao. Brush your text: Synthesize any scene text on images via diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7215–7223, 2024.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S Yu, and Lichao Sun. A comprehensive survey of AI-generated content (AIGC): A history of generative AI from GAN to ChatGPT. arXiv preprint arXiv:2303.04226, 2023.
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. Ediff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and YongJae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, and Mu Li. Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908, 2023.
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
Shangyu Chen, Zizheng Pan, Jianfei Cai, and Dinh Phung. Para: Personalizing text-to-image diffusion via parameter rank reduction. arXiv preprint arXiv:2406.05641, 2024.
Zijiao Chen, Jiaxin Qing, Tiange Xiang, WanLin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
Xuhang Chen, Baiying Lei, Chi-Man Pun, and Shuqiang Wang. Braindiffuser: An end-to-end brain image to brain network pipeline. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 16–26. Springer, 2023.
Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon, Yun Fu, Caiming Xiong, and Ran Xu. Gluegen: Plug and play multi-modal encoders for x-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23085–23096, 2023.
Alec Radford. Improving language understanding by generative pre-training. 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, volume 1, page 2. Minneapolis, Minnesota, 2019.
Wolfgang Köhler. Gestalt psychology. Psychologische Forschung, 31(1):XVIII–XXX, 1967.
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
JadedAI. EasyOCR. https://github.com/JaidedAI/EasyOCR, 2020.
Hyungyu Park, Sung-Jun Yoo, Kazuki Kuga, Eisaku Sumiyoshi, Hiroshi Harashima, and Kazuhide Ito. Impact of human micro-movements on breathing zone and thermal plume formation. Building and Environment, 264:111916, 2024.