Geographic Adaptation of Pretrained Language Models

Valentin Hofmann; Goran Glavaš; Nikola Ljubešić; Janet Pierrehumbert; Hinrich Schütze

Vol. 12 (2024)

TACL approved

Geographic Adaptation of Pretrained Language Models

Published 2024-04-22

Valentin Hofmann
Goran Glavaš
Nikola Ljubešić
Janet Pierrehumbert
Hinrich Schütze

Valentin Hofmann
University of Oxford, LMU Munich

Goran Glavaš
University of Würzburg

Nikola Ljubešić
Jožef Stefan Institute

Janet Pierrehumbert
University of Oxford

Hinrich Schütze
LMU Munich

Abstract

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

Article at MIT Press Presented at ACL 2024