Research on the Application of Large Language Model in Data Integration
DOI:
https://doi.org/10.4108/eetsis.10245Keywords:
large language models (LLMs), metadata fine-tuning, Transformer, data governance, reinforcement learning, symbolic regressionAbstract
INTRODUCTION: Large Language Models (LLMs), a major breakthrough in artificial intelligence, have been widely applied across various domains in recent years. Their powerful capabilities in language comprehension and generation enable effective handling of diverse natural language processing tasks, such as text generation, question answering, machine translation, and information retrieval. This paper investigates the application of LLM technology in data integration, a core aspect of data governance. In contrast to end-to-end black-box approaches, we reframe data integration as a problem of discovering interpretable mapping rules through symbolic regression.
OBJECTIVES: We begin by defining the fundamental problem of data integration. We then propose a general-purpose large model framework for data governance, built on a deep symbolic regression foundation. The framework comprises a symbolic expression generator and a metadata-enhanced executor, aiming to achieve both high accuracy and interpretability.
METHODS: The model is trained using a combination of recurrent neural networks and reinforcement learning techniques, for expression generation and the execution of the discovered rules is structured based on a Transformer encoder architecture enhanced with a dedicated metadata embedding layer. To enhance performance, we incorporate metadata fine-tuning, where the generated symbolic expressions serve as key metadata to guide the integration process.
RESULTS: Finally, the proposed model is evaluated on two representative data integration tasks, with experimental results demonstrating its effectiveness.
CONCLUSION: The results validate its practical quality and highlight the advantage of the symbolic regression paradigm in enhancing interpretability.
References
[1] M. Z. Alom, T. M. Taha, and C. Yakopcic, The History, Development, and Future of Large Language Models: A Survey, Journal of Artificial Intelligence Research, vol. 69,pp. 1–72, 2022.
[2] S. Anagnostidis, A. R. M. Siddique, and L. E. W. Johnson, A Review on the Role of Large Language Models in Medical Diagnosis, Nature Machine Intelligence, vol. 5, no. 4, pp. 234–245, 2023.
[3] A. Arora, S. K. Patel, and M. T. Nguyen, Personalized Learning Through Natural Language Processing: Opportunities and Challenges, IEEE Transactions on Learning Technologies, vol. 16, no. 2, pp. 210–224, 2023.
[4] J.-M. Attendu, J.-P. Corbeil, and É. Dubois, Enhancing Customer Service and Market Forecasting with Large Language Models, Journal of Business Analytics, vol. 6, no. 3, pp. 178–191, 2023.
[5] Y. Li, Z. Zhang, and X. Wang, A Survey on Data Governance in the Era of Big Data and AI, Data Science and Engineering, vol. 8, no. 1, pp. 45–60, 2023.
[6] T. Brown, B. Mann, and N. Ryder, Language Models are Few-Shot Learners, Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
[7] C. Raffel, N. Shazeer, and A. Roberts, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal of Machine Learning Research, vol.21, no. 140, pp. 1–67, 2020.
[8] G. Badaro, P. Papotti, and S. Bressan, Transformers for Tabular Data Representation: A Tutorial, Proceedings of the VLDB Endowment, vol. 15, no. 12, pp. 3746–3749, 2022.
[9]N. Baruah, R. K. Gupta, and S. Mittal, Parallelism-Optimizing Data Placement for Faster Data-Parallel Computations, Proceedings of the VLDB Endowment, vol.16, no. 4, pp. 760–771, 2022.
[10] J. Wei, X. Wang, and D. Schuurmans, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Advances in Neural Information Processing Systems, vol.35, pp. 24824–24837, 2022.
[11] H. Zhang, L. Zhao, and Y. Liu, Jellyfish: A Large Language Model for Data Preprocessing, IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2123–2136, 2024.
[12] M. Mahdavi, Z. Abedjan, and I. F. Ilyas, Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning, Proceedings of the VLDB Endowment,vol. 13, no. 12, pp. 1948–1961, 2020.
[13] Z. Miao, Y. Li, and X. Wang, Rotom: A Meta-Learned DataAugmentation Framework for Entity Matching and Data Cleaning, in Proceedings of the 2021 International Conference on Management of Data, pp. 1303–1316, 2021.
[14] T. Rekatsinas, X. Chu, and I. F. Ilyas, Holoclean: HolisticData Repairs with Probabilistic Inference, Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1190–1201,2017.
[15] T. Schick, J. Dwivedi-Yu, and R. Dessi, Toolformer: Language Models Can Teach Themselves to Use Tools, Advances in Neural Information Processing Systems, vol.36, pp. 68539–68551, 2023.
[16] N. Shinn, B. Labash, and A. Gopinath, Reflexion: An Autonomous Agent with Dynamic Memory and Self-Reflection, Journal of Artificial Intelligence Research, vol. 77, pp. 451–480, 2023.
[17] D. Chen, H. Wu, and Y. Yang, Data-Juicer: A One-Stop Data Processing System for Large Language Models, in Companion of the 2024 International Conference on Management of Data, pp. 120–134, 2024.
[18] H. Chen, Z. Wang, and L. Sun, Maybe Only 0.5% Data Is Needed: Preliminary Exploration of Low Training Data Instruction Tuning, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10245–10252,2023.
[19] L. Chen, M. Zaharia, and J. Yoon, Punica: Multi-Tenant LoRA Serving, Proceedings of Machine Learning and Systems, vol. 6, pp. 1–13, 2024.
[20] A. Chevalier, A. Wettig, and D. Chen, Adapting Language Models to Compress Contexts, Computational Linguistics, vol. 49, no. 3, pp. 701–730, 2023.
[21] S. Yao, J. Zhao, and D. Yu, ReAct: Synergizing Reasoning and Acting in Language Models, in International Conference on Learning Representations, 2023.
[22] Y. Qin, S. Liang, and Y. Feng, ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs, IEEE Transactions on Software Engineering, vol. 50, no. 4,pp. 1123–1137, 2024.
[23] J. Peng, Y. Tang, and H. Garcia-Molina, Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks,” Proceedings of the VLDB Endowment, vol. 16, no. 3, pp. 433–446, 2022.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Zhanfang Chen, Yuan Ren, Xiaoming Jiang, Ruipeng Qi

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.
