Hatim Alsuwat

Date of Award

Fall 2019

Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Csilla Farkas

Second Advisor

Marco Valtorta


In this dissertation, we investigate and address two kinds of data integrity threats. We first study the limitations of secure cryptographic shuffling algorithms regarding preservation of data dependencies. We then study the limitations of machine learning models regarding concept drift detection. We propose solutions to address these threats.

Shuffling Algorithms have been used to protect the confidentiality of sensitive data. However, these algorithms may not preserve data dependencies, such as functional de- pendencies and data-driven associations. We present two solutions for addressing these shortcomings: (1) Functional dependencies preserving shuffle, and (2) Data-driven asso- ciations preserving shuffle. For preserving functional dependencies, we propose a method using Boyce-Codd Normal Form (BCNF) decomposition. Instead of shuffling the original relation, we recommend to shuffle each BCNF decomposition. The final shuffled rela- tion is constructed by joining the shuffled decompositions. We show that our approach is lossless and preserves functional dependencies if the BCNF decomposition is dependency preserving. For preserving data-driven associations, we generate the transitive closure of the sets of attributes that are associated. Attributes of each set are bundled together during shuffling.

Concept drift is a significant challenge that greatly influences the accuracy and relia- bility of machine learning models. There is, therefore, a need to detect concept drift in order to ensure the validity of learned models. We study the issue of concept drift in the context of discrete Bayesian networks. We propose a probabilistic graphical model frame- work to explicitly detect the presence of concept drift using latent variables. We employ latent variables to model real concept drift and uncertainty drift over time. For modeling real concept drift, we propose to monitor the mean of the distribution of the latent variable over time. For modeling uncertainty drift, we suggest to monitor the change in belief of the latent variable over time, i.e., we monitor the maximum value that the probability den- sity function of the distribution takes over time. We also propose a probabilistic graphical model framework that is based on using latent variables to provide an explanation of the detected posterior probability drift across time.

Our results show that neither cryptographic shuffling algorithms nor machine learning models are robust against data integrity threats. However, our proposed approaches are capable of detecting and mitigating such threats.