State of What Art? A Call for Multi-Prompt LLM Evaluation

Moran Mizrahi; Guy Kaplan; Dan Malkin; Rotem Dror; Dafna Shahaf; Gabriel Stanovsky

Vol. 12 (2024)

TACL approved

State of What Art? A Call for Multi-Prompt LLM Evaluation

Published 2024-08-23

Moran Mizrahi
Guy Kaplan
Dan Malkin
Rotem Dror
Dafna Shahaf
Gabriel Stanovsky

Moran Mizrahi

Guy Kaplan

Dan Malkin

Rotem Dror

Dafna Shahaf

Gabriel Stanovsky
Hebrew University of Jerusalem

Abstract

Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.

Presented at ACL 2024 Article at MIT Press