Diagnostics and Model Selection for Generalized Linear Models and Generalized Estimating Equations

Chelsea Boquet Deroche, University of South Carolina - Columbia

Abstract

The use of generalized linear models and generalized estimating equations in the public health and medical fields are important tools for research, specifically for modeling clinical trials, evaluating preventive measures, and secondary data analysis. It is important for these researchers to have the necessary tools to analyze and model their data correctly. This dissertation focuses on a penalized maximum likelihood estimation method for generalized linear models, measures of association such as the coefficient of determination and R2 for generalized estimating equations, and a modified quasi-likelihood information criterion for generalized estimation equations. Common problems that arise during estimation of generalized linear models are bias of the estimates, small sample size, or complete or quasi-complete separation of data points. To address these problems, the first part of this dissertation introduces a penalized maximum likelihood approach that includes a penalty term directly in the score function prior to maximization of the likelihood, and then implements this method into statistical software. Generalized estimating equations are also an innovative way to model the within group correlation for longitudinal, clustered, or panel data. Currently, not many diagnostic statistics are available for these models. In the second part of this dissertation, we propose an R2 and several pseudo-R2 measures that help researchers with variable selection and provide a goodness of fit measure for the selected model. These calculations are also made accessible to researchers in statistical software. Generalized estimating equations are an extension to the generalized linear model specifically designed to address the within group correlation. To model the within group correlation in generalized estimating equations, the researcher must select the working correlation structure. However, the current quasi-likelihood information criterion for selecting the working correlation structure is not efficient in that it tends to favor the independent structure which assumes there is no within group correlation. In the last part of this dissertation, we propose a modified quasi-likelihood information criterion that outperforms the current quasi-likelihood information criterion in that this criterion favors the correct structure a large majority of the time. The efficiency of the estimates are improved when using the modified quasi-likelihood information criterion.