Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Yan Tong


In spite of great progress achieved on posed facial display and controlled image acquisition, performance of facial action unit (AU) recognition degrades significantly for spontaneous facial displays. Furthermore, recognizing AUs accompanied with speech is even more challenging since they are generally activated at a low intensity with subtle facial appearance/geometrical changes during speech, and more importantly, often introduce ambiguity in detecting other co-occurring AUs, e.g., producing non-additive appearance changes. All the current AU recognition systems utilized information extracted only from visual channel. However, sound is highly correlated with visual channel in human communications. Thus, we propose to exploit both audio and visual information for AU recognition. Specifically, a feature-level fusion method combining both audio and visual features is first introduced. Specifically, features are independently extracted from visual and audio channels. The extracted features are aligned to handle the difference in time scales and the time shift between the two signals. These temporally aligned features are integrated via feature-level fusion for AU recognition. Second, a novel approach that recognizes speech-related AUs exclusively from audio signals based on the fact that facial activities are highly correlated with voice during speech is developed. Specifically, dynamic and physiological relationships between AUs and phonemes are modeled through a continuous time Bayesian network (CTBN); then AU recognition is performed by probabilistic inference via the CTBN model. Third, a novel audiovisual fusion framework, which aims to make the best use of visual and acoustic cues in recognizing speech-related facial AUs is developed. In particular, a dynamic Bayesian network (DBN) is employed to explicitly model the semantic and dynamic physiological relationships between AUs and phonemes as well as measurement uncertainty. AU recognition is then conducted by probabilistic inference via the DBN model. To evaluate the proposed approaches, a pilot AU-coded audiovisual database was collected. Experiments on this dataset have demonstrated that the proposed frameworks yield significant improvement in recognizing speech-related AUs compared to the state-of-the-art visual-based methods. Furthermore, more impressive improvement has been achieved for those AUs, whose visual observations are impaired during speech.


© 2018, Zibo Meng