Skip to main navigation menu Skip to main content Skip to site footer

Automatic Detection and Language Identification of Multilingual Documents

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web. 

PDF (Presented at EACL 2014)