The Language Demographics of Amazon Mechanical Turk

Ellie Pavlick; Matt Post; Ann Irvine; Dmitry Kachaev; Chris Callison-Burch

Vol. 2 (2014)

TACL approved

The Language Demographics of Amazon Mechanical Turk

Published 2014-02-28

Ellie Pavlick
Matt Post
Ann Irvine
Dmitry Kachaev
Chris Callison-Burch

Ellie Pavlick
University of Pennsylvania

Matt Post
Johns Hopkins University

Ann Irvine
Johns Hopkins University

Dmitry Kachaev
Johns Hopkins University

Chris Callison-Burch
University of Pennsylvania

Abstract

We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.

PDF (Presented at ACL 2014)