Date of Award
Fall 2024
Document Type
Open Access Thesis
Department
Computer Science and Engineering
First Advisor
Neset Hikmet
Abstract
The healthcare industry is rapidly transforming due to technology adoption, resulting in an explosion of data. Extract, Transform, Load (ETL) processes are crucial for integrating and analyzing this data to support decision-making and enhance patient care. However, ETL processes face significant challenges, including data diversity, quality issues, security and compliance, and scalability. Opportunities exist to optimize ETL processes through advanced technologies like big data analytics, containerization, and parallel computing, improving data quality, and enhancing security. This literature review examines current ETL processes in healthcare, highlighting challenges and opportunities for future improvement, ultimately aiming to enhance healthcare outcomes and patient experiences.
In the seccond study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications.
Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline.
This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.
Rights
© 2025, Ehsan Soltanmohammadi
Recommended Citation
Soltanmohammadi, E.(2024). Innovative Strategies for Healthcare Data Integration: Enhancing Etl Efficiency Through Containerization and Parallel Computing. (Master's thesis). Retrieved from https://scholarcommons.sc.edu/etd/8159