Reasoning over Public and Private Data in Retrieval-Based Systems

Simran Arora; Patrick Lewis; Angela Fan; Jacob Kahn; Christopher Ré

Vol. 11 (2023)

TACL approved

Reasoning over Public and Private Data in Retrieval-Based Systems

Published 2023-08-10

Simran Arora
Patrick Lewis
Angela Fan
Jacob Kahn
Christopher Ré

Simran Arora
Stanford University

Patrick Lewis
FAIR

Angela Fan
FAIR

Jacob Kahn
FAIR

Christopher Ré
Stanford University

Abstract

Users and organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private context is important to personalize open-domain tasks such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve information that is relevant to an input question from a background corpus before producing an answer. While today's retrieval systems assume relevant corpora are fully (e.g., publicly) accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We define the Public-Private Autoregressive Information Retrieval (PAIR) problem involving retrieval over multiple privacy scopes. We introduce a foundational benchmark with which to study PAIR, as no existing benchmark includes data from a private distribution. Our dataset, ConcurrentQA, includes data from distinct public and private distributions and is the first textual QA benchmark requiring concurrent retrieval over multiple distributions. Finally, we show that existing retrieval approaches face significant performance degradations when applied to our proposed retrieval setting and investigate approaches with which these tradeoffs can be mitigated. We release the QA system and new benchmark.

Article at MIT Press Presented at EMNLP 2023