Date of Award

Fall 2020

Document Type

Open Access Dissertation

Department

Statistics

First Advisor

David B. Hitchcock

Abstract

This dissertation focuses on improving multivariate methods of cluster analysis. In Chapter 3 we discuss methods relevant to the categorical clustering of tertiary data while Chapter 4 considers the clustering of quantitative data using ensemble algorithms. Lastly, in Chapter 5, future research plans are discussed to investigate the clustering of spatial binary data.

Cluster analysis is an unsupervised methodology whose results may be influenced by the types of variables recorded on observations. When dealing with the clustering of categorical data, solutions produced may not accurately reflect the structure of the process that generated them. Increased variability within the latent structure of the data and the presence of noisy observations are two issues that may be obscured within the categories. It is also the presence of these issues that may cause clustering solutions produced in categorical cases to be less accurate. To remedy this, in Chapter 3, a method is proposed that utilizes concepts from statistics to improve the accuracy of clustering solutions produced in tertiary data objects. By pre-smoothing the dissimilarities used in traditional clustering algorithms, we show it is possible to produce clustering solutions more reflective of the latent process from which observations arose. To do this the Fienberg-Holland estimator, a shrinkage-based statistical smoother, is used along with 3 choices of smoothing. We show the method results in more accurate clusters via simulation and an application to diabetes.

Solutions produced from clustering algorithms may vary regardless of the type of variables observed. Such variations may be due to the clustering algorithm used, the initial starting point of an algorithm, or by the type of algorithm used to produce such solutions. Furthermore, it may sometimes be of interest to produce clustering solutions that allow observations to share similarities with more than one cluster. One method proposed to combat these problems and add flexibility to clustering solutions is fuzzy ensemble-based clustering. In Chapter 4 three fuzzy ensemble based clustering algorithms are introduced for the clustering of quantitative data objects and compared to the performance of the traditional Fuzzy C-Means algorithm. The ensembles proposed in this case, however, differ from traditional ensemble-based methods of clustering in that the clustering solutions produced within the generation process have resulted from supervised classifiers and not from clustering algorithms. A simulation study and two data applications suggest that in certain settings, the proposed fuzzy ensemble-based algorithms of clustering produce more accurate clusters than the Fuzzy C-Means algorithm.

In both of the aforementioned cases, only the types of variables recorded on each object were of importance in the clustering process. In Chapter 5 the types of variables recorded and their spatial nature are both of importance. An idea is presented that combines applications to geodesics with categorical cluster analysis to deal with the spatial and categorical nature of observations. The focus in this chapter is on producing an accurate method of clustering the binary and spatial data objects found in the Global Terrorism Database.

Share

COinS