Date of Award


Document Type

Open Access Dissertation


School of Library and information Science


College of Information and Communications

First Advisor

Amir Karami


Previous studies have documented the relationship that exists among diabetes, diet, exercise, and obesity. Obesity increases people’s risk of developing heart disease and type 2 diabetes. Exercise and proper dieting are modifiable lifestyle behaviors that can help with reducing people’s overall weight and risk to various chronic conditions like diabetes. A national survey conducted by the Centers for Diseases Control and Prevention (CDC) is the annual Behavioral Risk Factor Surveillance Survey (BRFSS). Twitter provides researchers with a new opportunity and alternative data source to collect information regarding health behaviors using real-time data. Previous studies have demonstrated Twitter’s ability to monitor adverse side-effects of drugs, tobacco use, and life satisfaction. Twitter can be a cost-effective way to gather information from study participants and collect population-level research data. Few studies have utilized a small-scale Twitter study to retrieve user-generated content regarding diet, diabetes, exercise, and obesity to characterize the topics associated with them; a human evaluation of the sentiment analysis and topic results was also conducted. The research questions guiding this study are: RQ1: What are the positive and negative sentiments of Twitter users regarding diet, diabetes, exercise, and obesity (DDEO)?, RQ2: What health experiences are prevalent based on Twitter users’ sentiments regarding DDEO?, RQ3: How does the performance of the computational tools used for sentiment analysis and topic modeling compare to the use of human performance?

The systematic steps in constructing this surveillance framework include data collection, data cleaning, sentiment analysis, topic discovery, topic analysis, and evaluation. Nearly 15 million tweets were collected through the Twitter API from June 2016 – August 2016. Sentiment analysis and the Latent Dirichlet Allocation (LDA) topic modeling text mining methods were used to answer RQ1 and RQ2. The LDA model allows for the discovery latent semantic structure in a corpus. In LDA, each topic can be characterized by a probabilistic distribution over a set of documents – paired with linguistic analysis to capture individual's positive and negative sentiments. Eight-hundred topics were analyzed (100 for each query term and each sentiment) through the topic analysis step of the framework. Percentage agreement and Cohen’s kappa statistics were used to address RQ3. Five hundred and sixty-eight (or 71%) of the 800 topics were identifiable and related to DDEO. Sentiment prevalence across DDEO includes topics of lifestyle, childhood obesity, food, and type of diets. Hypothyroidism, dementia, and diabetic retinopathy are additional chronic conditions identified through this framework. An essential aspect of the analytical process that this framework supports is a different approach to understanding interrelated health topics from relatively small-scale Twitter data, with the qualitative characterization of those topics. This surveillance text mining framework can assist clinical and allied health professionals with exploring understudied chronic health issues and identify latent risk factors.


© 2018, George Shaw Jr.