Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

Zhengping Jiang; Anqi Liu; Benjamin Van Durme

Vol. 12 (2024)

TACL approved

Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

Published 2024-02-19

Zhengping Jiang
Anqi Liu
Benjamin Van Durme

Zhengping Jiang
Johns Hopkins University

Anqi Liu
Johns Hopkins University

Benjamin Van Durme
Johns Hopkins University

Abstract

Computational linguistics models commonly target the prediction of discrete — categorical — labels. When assessing how well-calibrated these model predictions are, popular evaluation schemes require practitioners to manually determine a binning scheme: grouping labels into bins to approximate true label posterior. The problem is that these metrics are sensitive to binning decisions. We consider two solutions to the binning problem that apply at the stage of data annotation: collecting either distributed (redundant) labels or direct scalar value assignment.

In this paper, we show that although both approaches address the binning problem by evaluating instance-level calibration, direct scalar assignment is significantly more cost-effective. We provide theoretical analysis and empirical evidence to support our proposal for dataset creators to adopt scalar annotation protocols to enable a higher-quality assessment of model calibration.

Article at MIT Press

Author Biography

Zhengping Jiang

I’m a 2nd year Ph.D. student at JHU-CLSP advised by Prof. Benjamin Van Durme and Anqi Liu. Before that, I got my bachelor’s degree in computer science and Chinese language and literature (double major) from Peking University and my master’s degree in computer science at Columbia University. I’m an NLP enthusiast, with a primary interest in building more calibrated NLP models and utilizing model uncertainty estimation.