Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
Abstract
Recent studies show that instruction tuning (IT) and reinforcement
learning from human feedback (RLHF) improve the abilities of large
language models (LMs) dramatically. While these tuning methods can help
align models with human objectives and generate high-quality text, not
much is known about their potential adverse effects. In this work, we
investigate the effect of IT and RLHF on decision making and reasoning
in LMs, focusing on three cognitive biases—the decoy effect, the
certainty effect, and the belief bias—all of which are known to
influence human decision-making and reasoning. Our findings highlight
the presence of these biases in various models from the GPT-3, Mistral,
and T5 families. Notably, we find a stronger presence of biases in
models that have undergone instruction tuning, such as Flan-T5,
Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward
comprehending cognitive biases in instruction-tuned LMs, which is
crucial for the development of more reliable and unbiased language
models.
Author Biography
Gabriel
CS faculty at the Hebrew University
Nir
Assistant professor of computer sciense at the Technion.
Yonatan
I’m a faculty member at the Technion Taub Faculty of Computer Science and a former Azrieli Faculty Fellow.