Preregistration and Registration as a New Method for Transparency of External Validation in Artificial Intelligence and Machine Learning Applications to Address Overfitting, Underspecification, and Shortcut Learning
Abstract
Statistics has been defined as the study of how information should be employed to reflect on and give guidance for action in a practical situation involving uncertainty. The essence of uncertainty is that there is more than one possible outcome, and the actual outcome is unknown in advance; it is indeterminate. The goal of statistical methods is inference: namely, statistical inference—reaching conclusions about populations or deriving scientific insights from data which are collected from a representative sample of that population through providing a mathematical understanding of inference, quantifying the degree of support that data offer for assertions of knowledge, as well as providing a basis for evaluating actions that are proposed on the basis of asserted knowledge. Though many statistical techniques are capable of creating predictions about new data, the impetus for their use as a statistical methodology is to make inferences about relationships between variables. Statisticians are divided into two camps based on what objects in the world they view as corresponding to probabilities. Frequentists interpret probabilities as an objective property, whereas for the Bayesian, probability is interpreted as a measure of the degree of subjective certainty. How these two schools interpret probability defines their respective modeling processes and the inferences therefrom. Classical/frequentist statistics assume that population parameters are unknown constants, given that complete and exact knowledge about the sample space is not available. For frequentist estimation of population characteristics, the concept of probability is used to describe the outcomes of measurements. Frequentist inference about the model is therefore indirect, quantifying the agreement between the observed data and the data generated by a putative model. In the frequentist approach, pre-existing knowledge about the population from which the sample data is derived is used to design a statistical model of the process/phenomena and draw inferences to estimate unknown parameters in the model from the data. The usual focus is on the development of efficient estimators of model parameters and the development of efficient (powerful) tests of hypotheses about those unknown parameters. Bayesian inference, on the other hand, quantifies the uncertainty about the data-generating mechanism by the prior distribution and updates it with the observed data to obtain the posterior distribution. Inference about the model is therefore obtained directly as a probability statement based on the posterior. Bayesians make subjective probability statements (priors), and for the Bayesian probability is understood as a degree of belief about the values of the parameter under study. Bayesian methods explicitly use probability for quantifying uncertainty in inferences based on statistical data analysis and is thus the process of fitting a probability model to a set of data and summarizing the result by the probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations. In a Bayesian setting, a “model” thus consists of a specification of the joint distribution over all of the random variables in the problem and includes any parameters in the model as well as any latent (i.e., hidden) variables along with the variables whose values are to be observed or predicted. The Bayesian predictive algorithm requires the user’s inductive assumptions as input, which are assumptions explicitly codified as the prior and induced model. Whether or not a probability distribution is accurate for the question presented puts frequentists and Bayesians in the cross hairs of what constitutes an accurate representation. Bayesians often argue that no matter what the initial probabilities are, after a number of experiments, the probabilities of different individual experiments converge, which means that notwithstanding the subjective nature of the initial probabilities, the long results are still, arguably, scientifically objective. The frequentists, within which the majority of statisticians fall, contend that making any assumption regarding initial probabilities of a given hypothesis is not justified; instead, what remains is to restrict oneself to calculating the conditional probability of obtaining a certain statistical outcome given the hypothesis. Statistical inference includes the whole model building process, which has four main components, namely: (a) model formulation (or model specification), (b) model fitting (or model estimation), (c) model checking (or model validation), and (d) the combination of data from multiple sources (e.g., meta-analysis). Therefore, application of modeling for statistical inference concerns the investigative process as a whole of which model building is itself iterative and is simply part of the overarching, iterative statistical problem-solving process. Conversely, in the field of ML, the primary objective is the accuracy of the predictions: the “what” rather than the “how”—the relationship between the individual features and the outcome may be of little relevance so long as the prediction is accurate. Despite that several key concepts in AI and ML are derived from probability theory and statistics, statistical modeling and statistical inference fall outside the ambit of forms adopted for predictive modeling and predictive analytics within AI/ML but meanwhile highlight the essential role of statistical methods in AI/ML. From this classic divide between the frequentists and Bayesians comes the data modeling culture (aka the generative modeling culture) (DMC), which is undergirded by the notion that there is a true model generating the data and thus, seeks to develop stochastic models to fit the data that enable inferences about the data-generating mechanism based on the structure of those models with the aim of finding the truly "best" way to analyze the data. This view is aligned with the frequentists. On the other side, falls the algorithmic modeling culture (AMC), which is effectively silent about the underlying mechanism generating the data, and allows for the use of many different predictive, complex algorithmic approaches to estimate the function that maps input from output with the expectation that the input/output behavior of the learned function will be a close approximation to reality without concern for the form of the function that emerges, preferring to discuss only the accuracy of the predictions made by different algorithms on various datasets. In differentiating AI/ML within the AMC from statistical inference within the DMC, Breiman (2001) explained these two cultures’ respective goals in analyzing the data as follows: (1) prediction (for predictive modeling within the algorithmic modeling culture), where the aim is to predict what the responses are going to be to future input variables and (2) information (statistical inference within the data modeling culture), where the aim is to extract some information about how nature is associating the response variables to the input variables. While the research question often precedes the choice of statistical model, perhaps no single criterion exists that alone allows for a clear-cut distinction between the learning approaches used in the two cultures in all cases. Indeed, for decades, the two cultures have evolved within partly independent sociological niches. AI/ML can be seen as involving searching a very large space of possible hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner. Applications of AI/ML methods are more data-driven due to particularly flexible models; have scaling properties compatible with high dimensional data with myriads of input variables; and follow a heuristic agenda by prioritizing useful approximations to patterns in data. Likewise, computation-intensive pattern learning methods within the AMC have always had a strong focus on prediction in frequently extensive data with more modest concern for interpretability and the “right” underlying question. The objectives of causal explanation within the hypothesis-centric scientific method as compared to statistical inference within explanatory statistical modeling as compared to associational inference within predictive modeling in the AMC as evaluated by accuracy metrics guides not only how research is conducted but also governs the standards, structure, and process by which the research itself is conducted and is inextricably intertwined with whether the primary purpose of the research is explanation (and prediction) or associational inference (prediction). Notwithstanding the heavy reliance in ML on statistical techniques, statistical inference in the DMC is more limited as compared to the flexibility and scope of prediction within the AMC (predictive modeling). These “cultures” demonstrate the divisions between the processes of deductive, inductive, and abductive inference underlying (A) statistical inference and induction in the data modeling culture and (B) the hypothetico-deductive (H-D) method within the scientific method versus (C) inductive and abductive inference as employed in data science within the algorithmic modeling culture, which extracts knowledge or insights from data without a priori theories and without the condition of theoretical reflection (e.g., via lightweight theory-less big data analytics). The shift from theory-driven (data modeling culture) to process-driven prediction (algorithmic modeling culture) is examined through the lens of disclosures (the proposed Pre-Registration/Registration framework) identifying the epistemological challenges and various needs of theoretically informing big data analysis in both cultures throughout data acquisition, preprocessing, analysis, and interpretation but also by proposing a practical documentation framework and storage option to mitigate the reproducibility crisis facing AI/ML and to circumvent a default normative explainability standard that will be ineffectual, unenforceable, and tantamount to post hoc notice for regulated subjects. Implementing research techniques whose aim is explanatory inference and explanation versus associational inference (prediction) vary considerably between four primary forms of broad research technique categories: the hypothetico-deductive model of the scientific method; statistical inference; associational inference; and descriptive statistics. The theory-data relationship within each broad category further varies within and between different fields. These theoretical divergences have meaningful, practical implications in everyday life. A common problem of prediction models is that model performance is overstated or predictions for new subjects are too extreme (i.e., that the observed probability of the outcome is higher than predicted for low-risk subjects and lower than predicted for high-risk subjects). These issues arise as a consequence of assessing a model’s performance through its generalization error as defined in the AI/ML field, which is calculated on a dataset that is split from the original dataset used to train and optimize the model but kept separate and used only once for purposes of model evaluation. Thus, an AI/ML model is tested for its performance accuracy only on held out data split from the originally drawn sample data instead of on truly out of sample data. Hence, this idiosyncratic definition of generalization error employed in the AI/ML field in assessing model performance not only entails by definition that a model was not externally validated within the meaning of SI but further cuts against the qualities of reproducibility and transportability as defined in SI within the Data Modeling Culture. The utility of predictive models in SI depends on their generalizability, which can be separated into two components: internal validity (reproducibility) and external validity (transportability). Reproducibility refers to the accuracy metrics obtained by a model on data instances not considered for model development but belonging to the same population, and transportability entails the model’s achievement of accuracy in data sampled from the same underlying data-generating mechanism that are from different but related populations. A model has internal validity if it maintains its accuracy when applied to the subject of interest from the same underlying population as those in the sample used for development. It has external validity if it maintains its accuracy when applied to the outcome variable from populations intrinsically different from the development sample, with respect to location, time period, or methods used for data collection. External validity is thus concerned with how well results generalize or transport to other contexts. A decisive factor in the DMC in choosing a modeling technique for prediction is the performance of the resulting model at external validation. Internal validation can be enhanced by guarding against fitting random noise in the development data, but ensuring a model has external validity is considerably more difficult in SI. For example, while the internal validity of a SI clinical prediction model can be tested with split-sample, cross-validation, or bootstrapping methods, its external validity can be checked only by testing the model on new patients in new settings. In statistical inference, the relevant metrics as to generalizability are measures of statistical accuracy, which are measures of accuracy for parameters of a model and include standard errors, biases, and confidence intervals. Prediction error is a different quantity that measures how well a model predicts the response value of a future observation. Statistical inference is based upon these random variables and the underlying stochastic process assumed to generate them. In SI, the key properties of a point estimator are the bias and the precision of the estimator as measured by the variance (and standard error measure), which measures the precision of the estimator, and the mean squared error (and root mean squared error), which measure the overall accuracy of an estimator. The bias of a parameter estimator is defined as the difference between the expected value of the estimator and the parameter. Accuracy is the mean squared error which denotes the expected squared deviation of the estimator from the parameter, which measure combines bias and precision in one overall measure of how close an estimator is to the parameter it is measuring. In contrast, estimating parameter values by the Bayesian accounts is an estimate of a distribution of values since the posterior probability is comprised by a distribution of values. Therefore, the priors are often candidate probability distributions rather than specific, single-point values. The formation of predictive models is a central goal of Bayesian analysis and predictive accuracy is its highest virtue. The success of Bayesian methods depends heavily on the validity of the chosen priors and its inference process is foundational to the process of developing and testing a prediction model as implemented within the AI/ML field. The generalization error in AI/ML predictive modeling is computed using different metrics and a less rigorous process. The central challenge in ML is that the algorithm must perform well on new, previously unseen inputs; namely, on inputs that are not those which the model was trained on. The ability to perform well on previously unobserved inputs is called generalization in the AI/ML field. When training an AI/ML model, an error measure is computed from the training dataset using an error metric selected by the model developer and is termed the “training error.” Then a process of reducing the training error with the objective of decreasing or keeping the generalization error low, which is what separates learning in ML from an optimization problem. In the AI/ML field, the generalization error is also referred to as the “test error.” The generalization error is defined as the expected value of the error on a new input, where the expectation is taken across different possible inputs drawn from the distribution of inputs that the AI/ML system is expected to encounter in practice. Ideally, the generalization error of a ML model is measured by its performance on a test set of examples that were collected separately from the training set. However, in practice, the generalization error is calculated from the test set of held out data. This shortcut is justified through assumptions the AI/ML field has adopted from statistical learning (SL) that deem the calculation of a model’s generalization error on the test set as an unbiased estimator of model performance—hence, effectively in-sample data. The generalization error, which quantifies the model’s expected performance on unseen data, is calculated on the test set when only the training set can be observed by the model is justified based on the assumptions that the test set and training set data are identically distributed and independent from each other and are drawn from the same probability distribution (known as the i.i.d assumption). The i.i.d. assumptions enable the description of the data-generating process with a probability distribution over a single example. Then, the same distribution is used to generate every train example and every test example using the shared underlying data distribution. From these assumptions, AI/ML field has built the convenient theory, where the expected training error of a randomly selected model is expected to be equal to the expected test error of that model. This theory provides that the only difference between applying an AI/ML model to the test set versus an out-of-sample dataset collected separately from the training set amounts simply to the name assigned to the sampled dataset. In other words, according to the AI/ML field, both error expectations are properly calculated from the same dataset sample so long as certain conditions are met. Preregistration disclosures are related to the idea of external validation in SI. In this proposal (the proposed Pre-registration/Registration framework), I have incorporated the explanation/prediction debate into preregistration disclosures, which expand the state of the art framework of registered reports and preregistration documentation, into disclosures requiring more nuanced, granular detail for areas attaching to the specific context of implementation or deployment of a AI/ML-based predictive or explanatory model that captures the distinctions between the Hypethetico-deductive method (the scientific method); statistical inference in explanatory models and research within the data modeling culture; and associational inference within the algorithmic modeling culture. This proposal expands the disclosure requirements beyond the state of the art and its predication on the distinction between the phases of confirmatory versus exploratory research to further account for the current, pervasive issues confronting the AI/ML research community and industry arising from the fundamental gaps in AI/ML disclosures and reporting, which have served to create a reproducibility crisis in AI/ML and by extension, in its applied domains. The proposed preregistration disclosures serve to combat or at least identify the role and contributions of underspecification, underestimation, underfitting, overfitting, shortcut learning, the lack of perturbation and corruption robustness to adversarial attacks, and the absence of fairness transfer to out-of-distribution (OOD) data or even in-sample data to the incongruencies between reported and actual AI/ML model performance. The disclosure framework accounts for the distinctions between the data modeling and algorithmic modeling cultures and is consistent with the scientific method in requiring model developers to articulate and justify their design decisions and choices in a manner that operates as a countereffect to overhype and other researcher degrees of freedom that continue to contribute to overfitting and underfitting along with underspecification, increase bias by diluting fairness transfer on OOD data, and undercut the prevention and identification of shortcut learning. These substantive issues currently denigrating the integrity and reliability of reported AI/ML model performance coalesce within the bias-variance tradeoff and comprise a burgeoning epidemic facing modern society arising from irresponsible and sloppy misguided objectives held by industry and academia in the AI/ML field. The proposed disclosure framework requires transparency through a process that is interpretable; and yes, “interpretability” is selected in light of the eternal careening debates within the AI/ML field over the import and meaning of terms it has no seeming genuine impetus to embrace or understand. The proposed framework incorporates new disclosures that go beyond the requirements of validation in AI/ML (algorithmic modeling culture) by incorporating the tenets of statistical inference models in the data modeling culture, which require internal validation (reproducibility) and external validation (transportability) and assess realistic model performance and identify and quantify distributional shifts, which result from distortions of the calibration of risk predictions; performance drift, where poor predictive performance can arise from deficiencies in the development method of the original models; changes in the population characteristics over time; and improvements in the intervention procedure. The disclosure framework mandated by the proposed Pre-registration/Registration framework is a step in the right direction in improving upon the state-of-the-art in AI/ML registration and pre-registration in requiring symmetry between the validation process in SI in AI/ML models and systems. We will explore how this proposed framework addresses the explainability-interpretability-transparency debate using practical disclosures that are certainly more efficient than a default explainability rule and defensive litigation, whether in the enforcement or tort defense purview. The framework’s methodology is validated through a series of use cases of AI/ML models in the applied healthcare domain, which are summarized next. In this vein, we shall explore an example of these problem forms arising in the “wild” with a 2018 Google Health study published in a Nature Journal for a deep learning predictive model for predicting that learned from raw EHR data in making predictions for three distinct patient diagnostic tasks—mortality, readmission, and length of stay—which was trained, validated, and tested on datasets from two individual hospital systems located in two different states within the United States. The study’s disclosures of the model development process were woefully inadequate and exemplify the source of the AI/ML reproducibility crisis this proposed framework seeks to address. Fully extended coverage is set forth in Section 5.14, et. seq.. Use Case One shows how beating the benchmark encapsulates the primary thrust of Google’s 2018 Deep learning model on Electronic Health Records. In the end, making deep learning models more generalizable and less susceptible to bias may require national or international collaborations for data collection (including data from patients varying in race, ethnicity, language, and socioeconomic status) and to standardize and integrate data from different sources. As we shall see, without such oversight, any notion of AI/ML specific generalizability will be provided only with industry expertise within AI/ML, which often decodes values that fail to incorporate or account for the relevant domain’s standard of care within which the model is operationalized and with resulting attendant opacity in the absence of any meaningful transparency and the inability to check (or ascertain for that matter) the process by which a clinical model was in fact developed and validated. This contravenes what is required for a clinician to have confidence that the model is really capturing the correct patterns in the target domain, and not just patterns that were valid in past data that will not be valid in future data; instead, model interpretability in not just an apotheosis but an imprecise term that encompasses the necessity for a clinician to have (and demand means by which to form) a global or at least a local objectual understanding of the model. Use Case Two examines a 2020 Google Health sponsored research study published in Nature, where Google Health evaluated the performance of its deep learning model for predicting/identifying breast cancer in screening mammograms using two large clinical datasets from the U.S. and U.K. (McKinney et al.). Fully extended coverage is set forth in Section 5.15 et. seq. The predictions of the AI system were compared to those made by readers in routine clinical practice (meaning compared to the diagnostic results from the screening protocols implemented respectively in the US and UK by the clinicians at the time the screenings were performed) and were represented by the Google authors as showing that performance of the AI system exceeded that of individual radiologists. According to its critics, the Google team provided so little information about its code and how it was tested that the study amounted to nothing more than a promotion of proprietary tech. Industry research publications form a legitimizing source of the overhype inundating the commercial market and countenancing the relegation of scientific principles and methodology as subordinate to profits and represented efficiency. The replying Google authors’ asserted justifications for publishing research findings that were lacking transparency and were not reproducible are important to appraise in the context of scientific research as well in public-sector, where the ability to reconstitute a particular outcome is considered a public good. Use Case Two illustrates how implementation and evaluation assessments are heightened in the applied-domain area and is a case in point of why the transparency demanded by proposed Pre-Registration/Registration framework is necessitated even when a particular AI/ML application is recited as already passing the rigor of peer review. These cases were selected in light of the contextual environment of the AI/ML application and in light of Google’s later role in finally acknowledging issues with underspecification in its published study in 2022 that arise from the process by which its own AI/ML prediction models were validated despite years of works from multidisciplinary practitioners and academics in applied domains castigating the AI/ML “validation” approach as inadequate and without a theoretical basis and identifying a wide set of problems therewith. Moreover, in 2023, Google also found its own research publications evidenced the failure of fairness transfer, which will be explored in Chapter 6. The first use case is Google’s EHR patient outcome predictive model and the second, is Google’s 2020 breast cancer screening prognostic predictive model. Why the focus on Google? For a start, Google itself explains that its business model is a “Hybrid Research Model,” wherein the line between research and engineering activities are blurred such that research that has the potential to impact the world through “Google’s products and services” is encouraged. Indeed, the 2020 predictive model for breast cancer screening introduced by Google Health cements the connection between Google’s research and the commercialization of its research interests. Google highlights that its influence on the academic community, other companies and industries, and the field of computer science has most traditionally come from its research publications, which Google identified it continued to publish research results at increasing rates. An example of Google leveraging its influence is exemplified by the publication of more than 300 peer-reviewed health and medicine related studies by Google Health, which also employs more than 1,000 health researchers who are experts on machine intelligence. Accordingly, Use case Three examines these two Google Research publications. The first Google research study, D’Amour et. al. (2022), tested two research pipelines intended as precursors for medical applications: one ophthalmology pipeline for building models that detect diabetic retinopathy and referable diabetic macular edema from retinal fundus images, and one dermatology pipeline for building models to recognize common dermatological conditions from photographs of skin. In their experiments, they limited their consideration to pipelines that were validated only on randomly held-out data. Fully extended coverage is set forth in Section 6.11. Exemplar A of Use Case Three, an examination of Gulshan et. al. (2016), presents the issues of underspecification, overfitting, and shortcut learning. The importance of understanding a predictive model (or the lack thereof) and hence the necessity of explanatory modeling. Exemplar A of Use Case Three was also not replicable and further was not reproducible, which was countenanced by Google and the AI/ML field despite that the published findings were a well-known development and validation study, which had been published in the Journal of the American Medical Association and cited 906 times between its publication in November of 2016 and March 2019. Despite the notoriety and high impact of the study within the medical field, the source code had not been published and no knowledge of the algorithm or its optimization values had yet been published either as of the time a reproduction effort was initiated in a later replication attempt initiated during 2018 and for which findings were published in 2019. The hyper-parameters of the study were not published contemporaneously with the study’s publication. Next, the authors turn to Exemplar B of Use Case Three, which depicts the problems of underspecification and shortcut learning and the failure of fairness transfer due to systematic bias. Exemplar B within Use Case Three pertains to the use of DL in the early detection of skin cancer in a publication by Esteva et. al. (2017) and Liu et. al. (2020), which were both published in Nature. The study of Liu et. al. was authored by 22 individuals who were either employees or otherwise consultants of Google or Google Health. The findings of Esteva et. al. (2017) and Liu et. al. (2020) were both called into question by the reassessments completed in two Google research funded studies: Schrouff et. al. (2023) and D’Amour et. al. (2022). Esteva et. al. (2017) published their research in the preeminent journal Nature. The authors described how the CNN they had developed exceeded expert-human dermatologists in three key diagnostic tasks of identifying cancer. D’Amour et. al. (2022) concluded that the predictors in the medical imaging cases they had stressed tested—namely, the datasets, model architecture, and hyperparameters underlying the DL prediction model of the Original 2016 Gulshan Study as adjusted by Krause et. al. (2018)—had processed images in a systematically different way that only became apparent when evaluated on the held-out camera type. Gulshan et. al. (2016) disclosed in the Original 2016 Gulshan Study that the exact features used to make the DL model’s predictions were unknown to them. Ultimately, a DL model simply learns from what it is presented with. In the 2022 reassessment by D’Amour et. al., the Google authors found that correlations between the DL model’s inputs and labels for predicting diabetic retinopathy were comprised in part by spurious correlations resulting from data-set specific artefacts—namely, the camera type used to capture the images in the training dataset. One of the primary sources of error is similar to an example of shortcut learning. Next, the authors turn to Exemplar B of Use Case Three, which depicts the problems of underspecification and shortcut learning and the failure of fairness transfer due to systematic bias. Exemplar B within Use Case Three pertains to the use of DL in the early detection of skin cancer in a publication by Esteva et. al. (2017) and Liu et. al. (2020), which were both published in Nature. The study of Liu et. al. was authored by 22 individuals who were either employees or otherwise consultants of Google or Google Health. The findings of Esteva et. al. (2017) and Liu et. al. (2020) were both ca...