Abstract
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.
Main Results
For classification tasks, LLMs are generally robust to solving many items in one prompt. Aggregated over models and benchmarks, SingleClf reaches 75.5% while BatchClf reaches 72.3% (3.2-point drop). However, reformulating the same underlying problems as selection tasks causes large, consistent declines: SelectOne trails BatchClf by about 32 points and SelectAll by about 10 points.
Average accuracy of 7 LLMs across four classification-related tasks and task sizes in MPE.
Multi-problem prompting can substantially reduce inference costs. At operating points that preserve at least 95% of single-problem accuracy, reported token-cost savings range from 30.7% to 82.0% across model-benchmark pairs. This efficiency trend is visible in the lower cost/accuracy ratio as task size grows, with only a few long-context outliers.
Cost/accuracy ratio (lower is better) for SingleClf versus BatchClf across six benchmarks.
Conclusion
In this study, we present a comprehensive and systematic MPE of LLMs. We evaluate various LLMs from 4 model families on single-source multi-problem prompts constructed from 6 classification and 12 reasoning benchmarks. In line with previous few-shot results, we confirm that LLMs are competent multi-problem solvers for classification and reasoning under zero-shot settings. Moreover, we find multiple pieces of evidence that validate the strong innate multiple problem handling capabilities of LLMs, such as the similar classification errors LLMs make under SPP and MPP, the lack of obvious positional biases, and the transferrability of zero-shot-CoT under MPP. Leveraging the strong multiple problem handling capabilities, we show that zero-shot MPP can be cost-efficient.
Two conditions are identified under which LLMs show consistent performance declines with MPP: (1) reformulating Batch Classification as index selection tasks; and (2) mixing reasoning problems from different sources in a multi-problem prompt. Noticeably, these performance declines happen even when the number of problems included is rather small (e.g., <= 5), which may not be human-like and indicates a lack of true understanding. In addition, we explore several model-level factors that may enable MPP and find instruction tuning to be an important factor that enhances MPP.
Overall, our experiment demonstrate surprisingly consistent observations across different LLMs and across multi-problem prompts constructed from various benchmarks. This consistency indicates the reliability and fruitfulness of MPE as an evaluation paradigm.
As a result of our study, we create a new benchmark comprising 53,100 zero-shot multi-problem prompts. We call it ZeMPE, which stands for Zero-shot Multi-Problem Evaluation. We release ZeMPE to aid future MPE research.
BibTeX
@inproceedings{wang-etal-2025-evaluating,
title = "Evaluating {LLM}s with Multiple Problems at once",
author = "Wang, Zhengxiang and
Kodner, Jordan and
Rambow, Owen",
editor = "Arviv, Ofir and
Clinciu, Miruna and
Dhole, Kaustubh and
Dror, Rotem and
Gehrmann, Sebastian and
Habba, Eliya and
Itzhak, Itay and
Mille, Simon and
Perlitz, Yotam and
Santus, Enrico and
Sedoc, Jo{\~a}o and
Shmueli Scheuer, Michal and
Stanovsky, Gabriel and
Tafjord, Oyvind",
booktitle = "Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM{\texttwosuperior})",
month = jul,
year = "2025",
address = "Vienna, Austria and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.gem-1.14/",
pages = "178--199",
ISBN = "979-8-89176-261-9"
}