Comparing Bayesian Models of Annotation

Silviu Paun; Bob Carpenter; Jon Chamberlain; Dirk Hovy; Udo Kruschwitz; Massimo Poesio

Vol. 6 (2018)

TACL approved

Comparing Bayesian Models of Annotation

Published 2018-12-31

Silviu Paun
Bob Carpenter
Jon Chamberlain
Dirk Hovy
Udo Kruschwitz
Massimo Poesio

Silviu Paun
Queen Mary University of London

Bob Carpenter
Columbia University

Jon Chamberlain
University of Essex

Dirk Hovy
Bocconi University

Udo Kruschwitz
University of Essex

Massimo Poesio
Queen Mary University of London

Abstract

The analysis of crowdsourced annotations in NLP is concerned with identifying 1) gold standard labels, 2) annotator accuracies and biases, and 3) item difficulties and error patterns. Traditionally, majority voting was used for 1), and coefficients of agreement for 2) and 3). Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

Article at MIT Press (presented at EMNLP 2018)