MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Xinyu Zhang; Nandan Thakur; Odunayo Ogundepo; Ehsan Kamalloo; David Alfonso Hermelo; Xiaoguang Li; Qun Liu; Mehdi Rezagholizadeh; Jimmy Lin

Vol. 11 (2023)

TACL approved

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Published 2023-09-07

Xinyu Zhang
Nandan Thakur
Odunayo Ogundepo
Ehsan Kamalloo
David Alfonso Hermelo
Xiaoguang Li
Qun Liu
Mehdi Rezagholizadeh
Jimmy Lin

Xinyu Zhang
David R. Cheriton School of Computer Science, University of Waterloo, Canada

Nandan Thakur
David R. Cheriton School of Computer Science, University of Waterloo, Canada

Odunayo Ogundepo
David R. Cheriton School of Computer Science, University of Waterloo, Canada

Ehsan Kamalloo
Department of Computing Science, University of Alberta, Canada Huawei Noah’s Ark Lab

David Alfonso Hermelo
Huawei Noah’s Ark Lab

Xiaoguang Li
Huawei Noah’s Ark Lab

Qun Liu
Huawei Noah’s Ark Lab

Mehdi Rezagholizadeh
Huawei Noah’s Ark Lab

Jimmy Lin
David R. Cheriton School of Computer Science, University of Waterloo, Canada

Abstract

MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.

Article at MIT Press Presented at NAACL 2024