Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Emanuele Bugliarello; Ryan Cotterell; Naoaki Okazaki; Desmond Elliott

Vol. 9 (2021)

TACL approved

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Published 2022-01-04

Emanuele Bugliarello
Ryan Cotterell
Naoaki Okazaki
Desmond Elliott

Emanuele Bugliarello
University of Copenhagen

Ryan Cotterell
ETH Zurich, University of Cambridge

Naoaki Okazaki
Tokyo Institute of Technology

Desmond Elliott
University of Copenhagen

Abstract

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

Article at MIT Press Presented at EMNLP 2021