Collective Human Opinions in Semantic Textual Similarity

Wang Yuxia; Shimin Tao; Ning Xie; Hao Yang; Karin Verspoor; Timothy Baldwin

Vol. 11 (2023)

TACL approved

Collective Human Opinions in Semantic Textual Similarity

Published 2023-08-17

Wang Yuxia
Shimin Tao
Ning Xie
Hao Yang
Karin Verspoor
Timothy Baldwin

Wang Yuxia
The University of Melbourne

Shimin Tao
Huawei Translation Services Center

Ning Xie
Huawei Translation Services Center

Hao Yang
Huawei Translation Services Center

Karin Verspoor
The University of Melbourne RMIT University

Timothy Baldwin
The University of Melbourne Mohamed bin Zayed University of Artificial Intelligence

Abstract

Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS , the first Uncertainty-aware STS dataset with ∼15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgements adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.

Article at MIT Press Presented at ACL 2023