Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Vaibhav Adlakha; Parishad BehnamGhader; Xing Han Lu; Nicholas Meade; Siva Reddy

Vol. 12 (2024)

TACL approved

Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Published 2024-05-25

Vaibhav Adlakha
Parishad BehnamGhader
Xing Han Lu
Nicholas Meade
Siva Reddy

Vaibhav Adlakha
McGill University and Mila - Quebec AI Institute

Parishad BehnamGhader
McGill University and Mila - Quebec AI Institute

Xing Han Lu
McGill University and Mila - Quebec AI Institute

Nicholas Meade

Siva Reddy
McGill University and Mila - Quebec AI Institute

Abstract

Instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA). By simply prepending relevant documents and an instruction to their input, these models can be adapted to various information domains and tasks without additional fine-tuning. However, these models tend to produce verbose responses with supplementary information, which makes traditional QA metrics like exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we evaluate instruction-following models along two fronts: 1) how well they satisfy user's information need (correctness), and 2) whether they disseminate information supported by the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness and propose simple token-overlap metrics that correlate highly with human judgments. Our analysis reveals that instruction-following models can outperform fine-tuned models for correctness. However, they struggle to accurately judge the relevance of the provided knowledge and often hallucinate in their responses. We hope our work encourages more holistic evaluation of instruction-following models for QA.

Article at MIT Press Presented at ACL 2024