Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/google/BEGIN-dataset.
Nouha Dziri is a Ph.D. student at the University of Alberta working within the Alberta Machine Intelligence Institute under the supervision of Dr. Osmar Zaiane. Her research interests revolve around generative deep learning models and conversational dialogue systems. In particular, her work focuses on modelling an intelligent agent which can have open-ended conversations indistinguishable from human ones. Before her Ph.D, she completed a MSc degree in Computer Science at the University of Alberta where she worked on building coherent dialogue systems. She has interned with brilliant researchers in the field at Google AI and Microsoft Research where she worked on dialogue generation modelling and evaluation.