PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Patrick Lewis; Yuxiang Wu; Linqing Liu; Pasquale Minervini; Heinrich Küttler; Aleksandra Piktus; Pontus Stenetorp; Sebastian Riedel

Vol. 9 (2021)

TACL approved

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Published 2022-01-04

Patrick Lewis
Yuxiang Wu
Linqing Liu
Pasquale Minervini
Heinrich Küttler
Aleksandra Piktus
Pontus Stenetorp
Sebastian Riedel

Patrick Lewis
University College London Facebook AI Research

Yuxiang Wu
University College London

Linqing Liu
University College London

Pasquale Minervini
University College London

Heinrich Küttler
Facebook AI Research

Aleksandra Piktus
Facebook AI Research

Pontus Stenetorp
University College London

Sebastian Riedel
University College London Facebook AI Research

Abstract

Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ pre-empts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off" to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

Presented at EMNLP 2021 Article at MIT Press

Author Biography

Patrick Lewis

Patrick Lewis is a final year PhD student splitting his time between University College London and Facebook AI Research, supervised by Sebastian Riedel and Pontus Stenetorp.

Patrick's research focussed on Knowledge-intensive Natural Language Processing. Specifically, Patrick focusses on how to represent, store and access knowledge and building more powerful, efficient and robust models for knowledge-intensive NLP tasks such as question answering.