Automated Reddit Data Annotation with Large Language Models
Document Type
Article
Abstract
In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. We leveraged latest Large Language Models (LLMs) such as OpenAI's GPT-4, Meta's LLAMA 3.1, Microsoft's Phi 3.5 and Google's Gemini 1.5 & Gemma 2, in addition to traditional machine learning and deep learning models such as BERT, CNN, SVM and XGBoost for sentence-level quit vaping intention detection. This study compares the outcomes of these models against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, and a custom variable called 'detail,' we developed 8 prompts to explain the task to the LLMs and evaluated the performance of these strategies against each other. Our preliminary findings emphasize the potential of using Generative AI tools in social media data analysis, especially in identifying users' subtle intentions that may elude human detection.
Digital Object Identifier (DOI)
Publication Info
Published in Proceedings 2025 IEEE 13Th International Conference on Healthcare Informatics ICHI 2025, 2025, pages 251-260.
APA Citation
Vuruma, S., Wu, D., Gupta, S. S., Aust, L., Lookingbill, V., Bellamy, W., Ren, Y., Kasson, E., Chen, L.-S., Cavazos-Rehg, P., Hu, D., Liu, H., & Huang, M. (2025). Automated Reddit Data Annotation with Large Language Models. 2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), 251–260.https://doi.org/10.1109/ICHI64645.2025.00049
Rights
© 2025, IEEE