Date of Award
2020
Document Type
Open Access Dissertation
Department
Statistics
First Advisor
David Hitchcock
Abstract
Classification problems are tackled across various industries throughout multiple disciplines. A model used for classification attempts to predict the class of an outcome variable based on some predictors. There are number of classification models available. But as the underlying population distribution of the predictors is always unknown it is difficult to know which model fits the situation best. Several studies have been done on which supervised model performs better given specific datasets. But little work has been done to compare the models’ performance for predicting one or more outcomes under multivariate settings.
This study compares the performance of seven popular statistical learning models used for classification when the dataset is from a multivariate population. The models are: K-nearest neighbor, logistic regression, support vector machines, linear discriminant analysis, random forest, adaptive boosting and gradient boosting. We compare these methods under three different distribution settings, e.g., multivariate normal distribution, multivariate t distribution and multivariate log-normal distributions for both binary, k=2 and multiple outcomes, k=5. Three different sample sizes, n=100, 300, 500 are considered along with two different number of predictors, p=3,10 to check if the performance changes with sample size and number of predictors. We also compare the models for balanced and unbalanced datasets under these different settings. The models are evaluated using two criteria: accuracy, which works best for balanced datasets and Cohen’s kappa coefficient, which gives good result under unbalanced datasets.
A 10-fold cross validation technique is used where the data is randomly split into 75% training set and 25 % testing set to test the models’ prediction skills on new data. The model parameters are tuned under each setting to get the best performing model. Boxplots are used to show the spread of performance metrics calculated from multiple iterations. It is found that for the multivariate normal distribution, boosting algorithms are superior to others, whereas for the multivariate t distribution, support vector machines are preferable. Lastly, for the multivariate log-normal distribution, all models perform well but the random forest algorithm was better than the rest in most cases. The preference of models changes with an increase in the number of classes depending on balanced and unbalanced datasets. But overall, the gradient boosting algorithm and random forest have good performance accuracies for all the settings. This is further implemented on a real dataset of heart disease patients to verify the results of the simulation. Gradient boosting produces the highest prediction accuracy among all seven models and the K-nearest neighbor had the narrowest spread of accuracies.
Rights
© 2020, Nubaira Rizvi
Recommended Citation
Rizvi, N.(2020). An Empirical Comparison of Machine Learning Models for Classification. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/7879