Skip to main navigation menu Skip to main content Skip to site footer

Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

Abstract

  Recent studies show that instruction tuning (IT) and reinforcement
learning from human feedback (RLHF) improve the abilities of large
language models (LMs) dramatically. While these tuning methods can help
align models with human objectives and generate high-quality text, not
much is known about their potential adverse effects. In this work, we
investigate the effect of IT and RLHF on decision making and reasoning
in LMs, focusing on three cognitive biases—the decoy effect, the
certainty effect, and the belief bias—all of which are known to
influence human decision-making and reasoning. Our findings highlight
the presence of these biases in various models from the GPT-3, Mistral,
and T5 families. Notably, we find a stronger presence of biases in
models that have undergone instruction tuning, such as Flan-T5,
Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward
comprehending cognitive biases in instruction-tuned LMs, which is
crucial for the development of more reliable and unbiased language
models.

Presented at ACL 2024 Article at MIT Press

Author Biography

Gabriel

CS faculty at the Hebrew University

Nir

Assistant professor of computer sciense at the Technion.

Yonatan

I’m a faculty member at the Technion Taub Faculty of Computer Science and a former Azrieli Faculty Fellow.