Natural Questions: a Benchmark for Question Answering Research

Tom Kwiatkowski; Jennimaria Palomaki; Olivia Redfield; Michael Collins; Ankur Parikh; Chris Alberti; Danielle Epstein; Illia Polosukhin; Jacob Devlin; Kenton Lee; Kristina Toutanova; Llion Jones; Matthew Kelcey; Ming-Wei Chang; Andrew Dai; Jakob Uszkoreit; Quoc Le; Slav Petrov

Vol. 7 (2019)

TACL approved

Natural Questions: a Benchmark for Question Answering Research

Published 2019-08-02

Tom Kwiatkowski
Jennimaria Palomaki
Olivia Redfield
Michael Collins
Ankur Parikh
Chris Alberti
Danielle Epstein
Illia Polosukhin
Jacob Devlin
Kenton Lee
Kristina Toutanova
Llion Jones
Matthew Kelcey
Ming-Wei Chang
Andrew Dai
Jakob Uszkoreit
Quoc Le
Slav Petrov

Tom Kwiatkowski
Google Inc.

Jennimaria Palomaki
Google Inc.

Olivia Redfield
Google Inc.

Michael Collins
Columbia University, Google Inc.

Ankur Parikh
Google Inc.

Chris Alberti
Google Inc.

Danielle Epstein

Illia Polosukhin

Jacob Devlin
Google Inc.

Kenton Lee
Google Inc.

Kristina Toutanova
Google Inc.

Llion Jones
Google Inc.

Matthew Kelcey

Ming-Wei Chang
Google Inc.

Andrew Dai
Google Inc.

Jakob Uszkoreit
Google Inc.

Quoc Le
Google Inc.

Slav Petrov
Google Inc.

Abstract

We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Article at MIT Press (presented at ACL 2019)