Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Byung-Doh Oh; William Schuler

Vol. 11 (2023)

TACL approved

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Published 2023-03-27

Byung-Doh Oh
William Schuler

Byung-Doh Oh
The Ohio State University

William Schuler
The Ohio State University

Abstract

This work presents results using multiple large pretrained language models showing that models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times, replicating and expanding upon earlier results limited to just GPT-2 (Oh et al., 2022). First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, providing strong empirical support for this trend. Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pretrained language models to study human language processing.

Presented at ACL 2023 Article at MIT Press