Skip to main navigation menu Skip to main content Skip to site footer

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences -- without explicit tokenization or vocabulary -- and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBERT model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.
Article at MIT Press Presented at ACL 2022