Hallucinations in Large Multilingual Translation Models

Nuno Miguel Guerreiro; Duarte M. Alves; Jonas Waldendorf; Barry Haddow; Alexandra Birch; Pierre Colombo; André F.T. Martins

Vol. 11 (2023)

TACL approved

Hallucinations in Large Multilingual Translation Models

Published 2023-12-17

Nuno M. Guerreiro
Duarte M. Alves
Jonas Waldendorf
Barry Haddow
Alexandra Birch
Pierre Colombo
André F.T. Martins

Nuno M. Guerreiro
Instituto de Telecomunicações; Instituto Superior Técnico, University of Lisbon

Duarte M. Alves
Instituto de Telecomunicações/Instituto Superior Técnico, Lisboa; Portugal

Jonas Waldendorf
School of Informatics, University of Edinburgh; United Kingdom

Barry Haddow
School of Informatics, University of Edinburgh; United Kingdom

Alexandra Birch
School of Informatics, University of Edinburgh; United Kingdom

Pierre Colombo
MICS, CentraleSupélec, Université Paris-Saclay; France

André F.T. Martins
Instituto de Telecomunicações/Instituto Superior Técnico, Lisboa/Unbabel; Portugal

Abstract

Hallucinated translations can severely undermine and raise safety issues when machine translation systems are deployed in the wild. Previous research on the topic focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in multilingual models across diverse translation scenarios. In this work, we fill this gap by conducting a comprehensive analysis -- over 100 language pairs across various resource levels and going beyond English-centric directions -- on both the M2M neural machine translation (NMT) models and GPT large language models (LLMs). Among several insights, we highlight that models struggle with hallucinations primarily in low-resource directions and when translating out of English, where, critically, they may reveal toxic patterns that can be traced back to the training data. We also find that LLMs produce qualitatively different hallucinations to those of NMT models. Finally, we show that hallucinations are hard to reverse by merely scaling models trained with the same data. However, employing more diverse models, trained on different data or with different procedures, as fallback systems can improve translation quality and virtually eliminate certain pathologies.

Presented at EMNLP 2023 Article at MIT Press