Surface Statistics of an Unknown Language Indicate How to Parse It

Dingquan Wang; Jason Eisner

Vol. 6 (2018)

TACL approved

Surface Statistics of an Unknown Language Indicate How to Parse It

Published 2018-12-31

Dingquan Wang
Jason Eisner

Dingquan Wang
Johns Hopkins University

Jason Eisner
Johns Hopkins University

Abstract

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training achieves further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work's interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.65 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Article at MIT Press (presented at EMNLP 2018)

Author Biography

Dingquan Wang

Computer Science