Date of Award

8-16-2024

Document Type

Open Access Dissertation

Department

Computer Science and Engineering

First Advisor

Song Wang

Abstract

Scene texts refer to arbitrary text presented in an image captured by a camera in the real world. The tasks of scene text detection and recognition from complex images play a crucial role in computer vision, with potential applications in scene understanding, information retrieval, robotics, autonomous driving, etc. Despite the notable progress made by existing deep-learning methods, achieving accurate text detection and recognition remains challenging for robust real-world applications. The challenges in scene text detection and recognition stem from: 1) diverse text shapes, fonts, colors, styles, layouts, etc.; 2) countless combinations of characters with unfixed attributes for complete detection, coupled with background interruptions that obscure character strokes and shapes in text recognition; and 3) the need for effective coordination of multiple sub-tasks in end-to-end learning. The fundamental issue lies in the absence of a particularly discriminative representation for the detection task, which involves locating exact complete words with unfixed attributes, and for the recognition task, which entails differentiating similar characters within words. Our research aims to address these challenges and enhance scene text detection and recognition by improving text discriminative representation. In this study, we focus on two interconnected problems: 1) Scene Text Recognition (STR), which involves recognizing text from scene images, and 2) Scene Text Spotting (STS), which entails simultaneously detecting and recognizing multiple texts in scene images.

Addressing the challenges of Scene Text Recognition (STR), the presence of text variations and complex backgrounds remain significant hurdles due to their impact on text feature representation. Numerous existing methods attempt to mitigate these issues by employing attentional regions, bounding boxes, or polygons. Despite these efforts, the text regions identified by such methods often retain undesirable background interference. In response, we propose a Background-Insensitive Network (BINet) that explicitly incorporates text Semantic Segmentation (SSN) to enhance the text representation and reduce the background interruptions. This approach eliminates the need for extensive pixel-level annotations in the STR training data. To maximize the benefits of semantic cues, we introduce novel segmentation refinement and embedding modules that refine text masks and strengthen visual features. Experimental results demonstrate that our proposed method significantly improves text recognition in the presence of complex backgrounds, achieving state-of-the-art performance across multiple public datasets. In tackling the problem of Scene Text Spotting (STS), we introduce two novel developments. Given that the task involves a multi-task model dedicated to locating and recognizing texts in scenes, the coordination of multiple sub-tasks can exert a significant impact on each other and, subsequently, on the overall performance. Current end-to-end text spotters commonly incorporate independent sequential pipelines to conduct different multi-tasks. However, this unidirectional pipeline leads to information loss and error propagation among sub-tasks. In light of these observations, we present CommuSpotter, designed to enhance multi-task communication by explicitly and concurrently exchanging compatible information throughout the scene text spotting process. To address task-specific inconsistencies, we introduce a Conversation Mechanism (CM) to extract and exchange expertise from each other of the sub-tasks. Additionally, we incorporate text semantic segmentation to address the problem of text variations in the text recognition sub-task, eliminating the need for extra annotations. Experimental results demonstrate that our improved text representation for both sub-tasks enhances performance across public datasets.

Another prominent limitation in multi-tasks coordination in Scene Text Spotting (STS) lies in the capability of extracting and refining text representation of instances for multiple sub-tasks. Existing methods often utilize features from Convolutional Neural Networks (CNNs) and shrink the text regions in representation to perform sequential tasks. Nevertheless, the effectiveness of these methods is primarily constrained by the contextual biases inherent in the representation of CNN backbones. These biases are challenging to filter out, complicating the identification of randomly appearing texts and introducing confusion in discerning the similar characters within text instances. In response to these challenges, we propose a novel approach named Assembling Text Spotter (ATS) to mitigate the problem. ATS initially decouples image contextual information from text structure information through the separation of dual backbones. The disentanglement of image and text information eliminates the need for filtering out one from the other. Subsequently, they are dynamically and purposely aligned to generate discriminative representations for different sub-tasks. Consequently, the Detection Aligning, before the detection classifier, globally indicates the overall scattered location information, while the Recognition Aligning, after both detection and recognition classifiers, effectively expresses text structure details for accurate text recognition. Extensive experiments conducted on existing scene text datasets demonstrate competitive performance results across multiple benchmarks for scene text spotting.

Rights

© 2024, Liang Zhao

Share

COinS