Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Song Wang


Interest detection is detecting an object, event, or process that draws attention. In this dissertation, we focus on interest detection in images, video and multiple videos. Interest detection in an image or a video is closely related to visual attention. However, the interest detection in multiple videos needs to consider all the videos as a whole rather than considering the attention in each single video independently.

Visual attention is an important mechanism of human vision. The computational model of visual attention has recently attracted a lot of interest in the computer vision community mainly because it helps find the objects or regions that efficiently represent a scene and thus aids in solving complex vision problems such as scene understanding. In this dissertation, we first introduce a new computational visual-attention model for detecting region of interest in static images and/or videos. This model constructs the saliency map for each image and takes the region with the highest saliency value as the region of interest. Specifically, we use the Earth Mover’s Distance (EMD) to measure the center-surround difference in the receptive field. Furthermore, we propose to take two steps of biologically-inspired nonlinear operations for combining different features: combining subsets of basic features into a set of super features using the Lm-norm and then combining the super features using the Winner-Take- All mechanism. Then, we extend the proposed model to construct dynamic saliency maps from videos by computing the center-surround difference in the spatio-temporal receptive field.

Motivated by the natural relation between visual saliency and object/region of interest, we then propose an algorithm to isolate infrequently moving foreground from background with frequent local motions, in which the saliency detection technique is used to identify the foreground (object/region of interest) and background. Traditional motion detection usually assumes that the background is static while the foreground objects are moving most of the time. However, in practice, especially in surveillance, the foreground objects may show infrequent motion. For example, a person may stand in the same place for most of the time. Meanwhile, the background may contain frequent local motions, such as trees and/or grass waving in the breeze. Such complexities may prevent the existing background subtraction algorithms from correctly identifying the foreground objects. In this dissertation, we propose a background subtraction approach that can detect the foreground objects with frequent and/or infrequent motions.

Finally, we focus on the task of locating the co-interest person from multiple temporally synchronized videos taken by the multiple wearable cameras. More specifically, we propose a co-interest detection algorithm that can find persons that draw attention from most camera wearers, even if multiple similar-appearance persons are present in the videos. Our basic idea is to exploit the motion pattern, location, and size of persons detected in different synchronized videos and use them to correlate the detected persons across different videos – one person in a video may be the same person in another video at the same time. We utilized a Conditional Random Field (CRF) to achieve this goal, by taking each frame as a node and the detected persons as the states at each node. We collect three sets of wearable-camera videos for testing the proposed algorithm where each set consists of six temporally synchronized videos.