Date of Award

Spring 2019

Document Type

Open Access Thesis


Computer Science and Engineering

First Advisor

Lannan Luo


Binary code analysis is important for understanding programs without access to the original source code, which is common with proprietary software. Analyzing binaries can be challenging given their high variability: due to growth in tech manufactur- ers, source code is now frequently compiled for multiple instruction set architectures (ISAs); however, there is no formal dictionary that translates between their assem- bly languages. The difficulty of analysis is further compounded by different compiler optimizations and obfuscated malware signatures. Such minutiae means that some vulnerabilities may only be detectable on a fine-grained level. Recent strides in ma- chine learning—particularly in Natural Language Processing (NLP)—may provide a solution: deep learning models can process large texts and encode the semantics of in- dividual words into vectors called word embeddings, which are convenient for process- ing and analyzing text. By treating assembly as a language and instructions as words, we leverage NLP ideas in order to generate individual instruction embeddings. Specif- ically, we choose to improve upon current models that are only single-architecture, or that suffer from performance issues when handling multiple architectures. This research presents a cross-architecture instruction embedding model that jointly en- codes instruction semantics from multiple ISAs, where similar instructions within and across architectures embed closely together. Results show that our model is accurate in extracting semantics from binaries alone, and our embeddings capture se- mantic equivalences across multiple architectures. When combined, these instruction embeddings can represent the meaning of functions or basic blocks; thus, this model may prove useful for cross-architecture bug, malware, and plagiarism detection.