Abstract
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BIGGEN BENCH, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BIGGEN BENCH is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available.
| Original language | English |
|---|---|
| Title of host publication | Long Papers |
| Editors | Luis Chiruzzo, Alan Ritter, Lu Wang |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 5877-5919 |
| Number of pages | 43 |
| ISBN (Electronic) | 9798891761896 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025 - Hybrid, Albuquerque, United States Duration: 29 Apr 2025 → 4 May 2025 |
Publication series
| Name | Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025 |
|---|---|
| Volume | 1 |
Conference
| Conference | 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025 |
|---|---|
| Country/Territory | United States |
| City | Hybrid, Albuquerque |
| Period | 29/04/25 → 4/05/25 |
Fingerprint
Dive into the research topics of 'The BIGGEN BENCH: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver