Automatic Detection and Language Identification of Multilingual Documents

Marco Lui; Jey Han Lau; Timothy Baldwin

Vol. 2 (2014)

TACL approved

Automatic Detection and Language Identification of Multilingual Documents

Published 2014-02-28

Marco Lui
Jey Han Lau
Timothy Baldwin

Marco Lui
NICTA VRL University of Melbourne

Jey Han Lau
Department of Philosophy King's College London

Timothy Baldwin
NICTA VRL University of Melbourne

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

PDF (Presented at EACL 2014)