Questions Are All You Need to Train a Dense Passage Retriever

Devendra Singh Sachan; Mike Lewis; Dani Yogatama; Luke Zettlemoyer; Joelle Pineau; Manzil Zaheer

Vol. 11 (2023)

TACL approved

Questions Are All You Need to Train a Dense Passage Retriever

Published 2023-06-24

Devendra Singh Sachan
Mike Lewis
Dani Yogatama
Luke Zettlemoyer
Joelle Pineau
Manzil Zaheer

Devendra Singh Sachan
McGill University, Mila - Quebec AI Institute

Mike Lewis
Meta Platforms, FAIR

Dani Yogatama
University of Southern California, Google DeepMind

Luke Zettlemoyer
University of Washington, Meta Platforms, FAIR

Joelle Pineau
McGill University, Mila - Quebec AI Institute, Meta Platforms, FAIR

Manzil Zaheer
Google DeepMind

Abstract

We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer passages). It uses a new passage retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence passages, and (2) the passages are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both passage and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pretrained language model, removing the need for labeled data and task-specific losses.

Article at MIT Press Presented at ACL 2023

Author Biography

Devendra Singh Sachan

Computer Science Department