The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Naman Goyal; Cynthia Gao; Vishrav Chaudhary; Peng-Jen Chen; Guillaume Wenzek; Da Ju; Sanjana Krishnan; Marc’Aurelio Ranzato; Francisco Guzmán; Angela Fan

Vol. 10 (2022)

TACL approved

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Published 2022-05-04

Naman Goyal
Cynthia Gao
Vishrav Chaudhary
Peng-Jen Chen
Guillaume Wenzek
Da Ju
Sanjana Krishnan
Marc’Aurelio Ranzato
Francisco Guzmán
Angela Fan

Naman Goyal
Facebook AI

Cynthia Gao
Facebook AI

Vishrav Chaudhary
Facebook AI

Peng-Jen Chen
Facebook AI

Guillaume Wenzek
Facebook AI

Da Ju
Facebook AI

Sanjana Krishnan
Facebook AI

Marc’Aurelio Ranzato
Facebook AI

Francisco Guzmán
Facebook AI

Angela Fan
Facebook AI

Abstract

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are fully aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

Presented at ACL 2022 Article at MIT Press