Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Annapurneswar Putrevu

Sarcouncil Journal of Multidisciplinary

An Open access peer reviewed international Journal
Publication Frequency- Monthly
Publisher Name-SARC Publisher

ISSN Online- 2945-3445
Country of origin- PHILIPPINES
Frequency- 3.6
Language- English

Keywords

Editors

Dr Hazim Abdul-Rahman
Associate Editor
Sarcouncil Journal of Applied Sciences

Entessar Al Jbawi
Associate Editor
Sarcouncil Journal of Multidisciplinary

Rishabh Rajesh Shanbhag
Associate Editor
Sarcouncil Journal of Engineering and Computer Sciences

Dr Md. Rezowan ur Rahman
Associate Editor
Sarcouncil Journal of Biomedical Sciences

Dr Ifeoma Christy
Associate Editor
Sarcouncil Journal of Entrepreneurship And Business Management

Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Keywords: Schema matching, Large Language Models, ETL automation, Retrieval-Augmented Generation, Data integration.

Abstract: Modern enterprises encounter significant challenges when integrating heterogeneous data sources due to schema variability, inconsistent naming conventions, and noisy real-world datasets that traditionally require extensive manual intervention. This work presents a novel generative AI framework that employs Large Language Models augmented with retrieval-based techniques to automate schema matching, column-level mapping, and transformation rule generation within ETL pipelines. The framework incorporates metadata-aware prompting strategies, domain-specific exemplar retrieval through RAG, and iterative self-refinement mechanisms to produce high-quality mapping suggestions alongside executable transformation code in SQL and pandas formats. Confidence scoring enables effective human-in-the-loop validation while adaptive feedback mechanisms facilitate continuous improvement from user corrections. The system demonstrates the capability of handling evolving schemas and multilingual datasets across diverse enterprise domains. Experimental evaluation on synthetic datasets and established benchmarks reveals substantial improvements in matching accuracy, precision, and recall compared to traditional name-similarity metrics and classical machine learning classifiers. The framework addresses critical deployment considerations, including data privacy compliance, operational scalability, and hallucination mitigation strategies. Results indicate significant potential for reliable large-scale enterprise adoption through transparent, auditable automated ETL workflows that maintain data quality standards while reducing manual overhead.

Sarcouncil Journal of Multidisciplinary

Keywords

Editors

Home

Aims & Scope

Archive

Indexing

Submit Article

Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Author

People

Policies

Submission

About Us

QUICK LINKS

JOIN US

USEFILL LINKS

SOCIAL MEDIA