Date of Award

2014

Document Type

Open Access Dissertation

Department

Statistics

First Advisor

David Hitchcock

Abstract

We give a brief introduction to cluster analysis and then propose and discuss a few methods for clustering mixed data. In particular, a model-based clustering method for mixed data based on Everitt's (1988) work is described, and we use a simulated annealing method to estimate the parameters for Everitt's model. A penalized log likelihood with the simulated annealing method is proposed as a remedy for the parameter estimates being drawn to extremes. Everitt's approach and the proposed method are compared based on their performance in clustering simulated data. We then use the penalized log likelihood method on a heart disease data set, in which our clustering result was compared to an expert's diagnosis using the Rand Index. We also describe an extension to Gower's (1971) coefficient, based on Kaufman and Rousseeuw's (1990) definition, which allows for other types of variables to be used in the coefficient. A clustering algorithm based on the extended Gower coefficient is used to find a clustering solution for a buoy data set to see how well this method classified a variety of sites into their "true" regions. To display how the method based on the extended Gower coefficient performed in clustering data having a variety of structures, we show the results of a simulation study.

Share

COinS