Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Rotem Dror; Gili Baumer; Marina Bogomolov; Roi Reichart

Vol. 5 (2017)

TACL approved

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Published 2017-11-27

Rotem Dror
Gili Baumer
Marina Bogomolov
Roi Reichart

Rotem Dror
Technion

Gili Baumer
Technion

Marina Bogomolov
Technion

Roi Reichart
Technion

Abstract

With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.

PDF (presented at ACL 2018)

Author Biography

Roi Reichart

Assistant Professor, Faculty of Industrial Engineering and Management, The Technion, Israel Institute of Technology.