Sub-Character Tokenization for Chinese Pretrained Language Models

Chenglei Si; Zhengyan Zhang; Yingfa Chen; Fanchao Qi; Xiaozhi Wang; Zhiyuan Liu; Yasheng Wang; Qun Liu; Maosong Sun

Vol. 11 (2023)

TACL approved

Sub-Character Tokenization for Chinese Pretrained Language Models

Published 2023-05-24

Chenglei Si
Zhengyan Zhang
Yingfa Chen
Fanchao Qi
Xiaozhi Wang
Zhiyuan Liu
Yasheng Wang
Qun Liu
Maosong Sun

Chenglei Si
University of Maryland

Zhengyan Zhang
Tsinghua University

Yingfa Chen
Tsinghua University

Fanchao Qi
Tsinghua University

Xiaozhi Wang
Tsinghua University

Zhiyuan Liu
Tsinghua University

Yasheng Wang
Noah’s Ark Lab, Huawei Technologies

Qun Liu
Noah’s Ark Lab, Huawei Technologies

Maosong Sun
Tsinghua University

Abstract

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

Article at MIT Press Presented at ACL 2023