Mingzhe Du

Date of Award

Spring 2020

Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Jose M. Vidal


In the social sciences, theories are used to explain and predict observed phenomena in the natural world. Theory construction is the research process of building testable scientific theories to explain and predict observed phenomena in the natural world. Conceptual new ideas and meanings of theories are conveyed through carefully chosen definitions and terms.

The principle of parsimony, an important criterion for evaluating the quality of theories (e.g., as exemplified by Occam’s Razor), mandates that we minimize the number of definitions (terms) used in a given theory.

Conventional methods for theory construction and parsimony analysis are based on heuristic approaches. However, it is not always easy for young researchers to fully understand the theoretical work in a given area because of the problem of “tacit knowledge”, which often makes results lack coherence and logical integrity. In this research, we propose to help with this problem in three parts.

In the first part of this study, we present Wikitheoria, a generic knowledge aggregation framework, to facilitate the parsimonious approach of theory construction with a cloud-based theory modularization platform and semantic-based algorithms to minimize the number of definitions. The proposed approach is demonstrated and evaluated using the modularized theories from the database and sociological definitions retrieved from the system lexicon and sociological literature. This study proves the effectiveness of using a cloud-based knowledge aggregation system and semantic analysis models for promoting the parsimonious sociology theory construction.

In the second part, our study is focused on semantic-based parsimony analysis. We introduce an embedding-based approach using machine learning models to reduce the semantically similar sociological definitions, where definitions are encoded with word embeddings and sentence embeddings. Given several types of embeddings exist, we compare the definition’s encodings with the goal of understanding what embeddings are more suitable for knowledge representation, and what classifiers are more capable of capturing semantic similarity in the task of parsimonious theory construction.

In the final part of this study, we propose SOREC, a novel semantic content-based recommendation system (CBRS) with the supervised machine learning model for theoretical parsimony evaluation by checking the semantic consistency of definitions while constructing theories. Specifically, we evaluate the XGBoost tree-based classifier with the combination of low-level features and high-level features on our dataset. The proposed CBRS substantially outperforms conventional matrix factorizationbased CBRS in suggesting semantically related sociological definitions. In this study, we provide a solid baseline for future studies in the research area of sociological definition semantic similarity computation. Moreover, theory construction is a common research process in a lot of human science-related disciplines such as psychology, criminology, and other social sciences. The results of this study can be further applied to the theory construction in these disciplines.