Date of Award
Open Access Dissertation
Computer Science and Engineering
In this research, we address the impact of data integrity on machine learning algorithms. We study how an adversary could corrupt Bayesian network structure learning algorithms by inserting contaminated data items. We investigate the resilience of two commonly used Bayesian network structure learning algorithms, namely the PC and LCD algorithms, against data poisoning attacks that aim to corrupt the learned Bayesian network model.
Data poisoning attacks are one of the most important emerging security threats against machine learning systems. These attacks aim to corrupt machine learning models by con- taminating datasets in the training phase. The lack of resilience of Bayesian network struc- ture learning algorithms against such attacks leads to inaccuracies of the learned network structure.
In this dissertation, we propose two subclasses of data poisoning attacks against Bayes- ian networks structure learning algorithms: (1) Model invalidation attacks when an ad- versary poisons the training dataset such that the Bayesian model will be invalid, and
(2) Targeted change attacks when an adversary poisons the training dataset to achieve a speciﬁc change in the structure. We also deﬁne a novel measure of the strengths of links between variables in discrete Bayesian networks. We use this measure to ﬁnd vulnera- ble sub-structure of the Bayesian network model. We use our link strength measure to ﬁnd the easiest links to break and the most believable links to add to the Bayesian net- work model. In addition to one-step attacks, we deﬁne long-duration (multi-step) data poisoning attacks when a malicious attacker attempts to send contaminated cases over a period of time. We propose to use the distance measure between Bayesian network models and the value of data conﬂict to detect data poisoning attacks. We propose a 2-layered framework that detects both traditional one-step and sophisticated long-duration data poi- soning attacks. Layer 1 enforces “reject on negative impacts” detection; i.e., input that changes the Bayesian network model is labeled potentially malicious. Layer 2 aims to detect long-duration attacks; i.e., observations in the incoming data that conﬂict with the original Bayesian model.
Our empirical results show that Bayesian networks are not robust against data poisoning attacks. However, our framework can be used to detect and mitigate such threats.
Alsuwat, E.(2019). Challenges in Large-Scale Machine Learning Systems: Security and Correctness. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/5596