Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani; Daisy Rosenblum; Antonios Anastasopoulos; Graham Neubig

Vol. 9 (2021)

TACL approved

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Published 2022-01-04

Shruti Rijhwani
Daisy Rosenblum
Antonios Anastasopoulos
Graham Neubig

Shruti Rijhwani
Carnegie Mellon University

Daisy Rosenblum
The University of British Columbia

Antonios Anastasopoulos
George Mason University

Graham Neubig
Carnegie Mellon University

Abstract

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29%, where we find the combination of self-training and lexically-aware decoding essential for achieving consistent improvements.

Presented at EMNLP 2021 Article at MIT Press