Skip to main navigation menu Skip to main content Skip to site footer

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice in NLP

Abstract

Classification systems are evaluated in a sheer countless number of papers. However, while seemingly a simple task, we find that evaluation practice is often nebulous. For instance, while many papers and shared tasks pick so-called `macro' metrics to rank systems (e.g., `macro F1'), the choice is often not supported with good arguments, and usage of phrases like `macro' seems blurry in general. This is problematic, since picking a metric can affect paper findings as well as shared task rankings and also result in perceived or real unfairness.

Starting from the basic concepts of bias and prevalence, we perform an analysis of popular metrics, revealing their hidden properties. Having obtained a profound understanding through this analysis, we will reflect on metric selection practices of recent NLP shared tasks. Surprisingly, we find that in many cases not a single argument is presented for choosing a particular metric as ranking criterion. Clearly, this can cast doubt on winner selection and rankings, also with possible ramifications for researcher careers. To alleviate this issue, we crystallize arguments for metric selection and provide guidance for increasing fairness and transparency in classification evaluation.

Article at MIT Press