Evaluating LLMs with Multiple Problems at once

Wang, Zhengxiang; Kodner, Jordan; Rambow, Owen

Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang, Jordan Kodner, Owen Rambow

GEM 2025

Abstract

This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.

Main Results

For classification tasks, LLMs are generally robust to solving many items in one prompt. Aggregated over models and benchmarks, SingleClf reaches 75.5% while BatchClf reaches 72.3% (3.2-point drop). However, reformulating the same underlying problems as selection tasks causes large, consistent declines: SelectOne trails BatchClf by about 32 points and SelectAll by about 10 points.

Average accuracy across LLMs, benchmarks, and task sizes for multiple-problem evaluation tasks.

Average accuracy of 7 LLMs across four classification-related tasks and task sizes in MPE.

Multi-problem prompting can substantially reduce inference costs. At operating points that preserve at least 95% of single-problem accuracy, reported token-cost savings range from 30.7% to 82.0% across model-benchmark pairs. This efficiency trend is visible in the lower cost/accuracy ratio as task size grows, with only a few long-context outliers.

Cost-accuracy ratio comparison between single-problem and multi-problem prompting across models and benchmarks.

Cost/accuracy ratio (lower is better) for SingleClf versus BatchClf across six benchmarks.

Conclusion

In this study, we present a comprehensive and systematic MPE of LLMs. We evaluate various LLMs from 4 model families on single-source multi-problem prompts constructed from 6 classification and 12 reasoning benchmarks. In line with previous few-shot results, we confirm that LLMs are competent multi-problem solvers for classification and reasoning under zero-shot settings. Moreover, we find multiple pieces of evidence that validate the strong innate multiple problem handling capabilities of LLMs, such as the similar classification errors LLMs make under SPP and MPP, the lack of obvious positional biases, and the transferrability of zero-shot-CoT under MPP. Leveraging the strong multiple problem handling capabilities, we show that zero-shot MPP can be cost-efficient.

Two conditions are identified under which LLMs show consistent performance declines with MPP: (1) reformulating Batch Classification as index selection tasks; and (2) mixing reasoning problems from different sources in a multi-problem prompt. Noticeably, these performance declines happen even when the number of problems included is rather small (e.g., <= 5), which may not be human-like and indicates a lack of true understanding. In addition, we explore several model-level factors that may enable MPP and find instruction tuning to be an important factor that enhances MPP.

Overall, our experiment demonstrate surprisingly consistent observations across different LLMs and across multi-problem prompts constructed from various benchmarks. This consistency indicates the reliability and fruitfulness of MPE as an evaluation paradigm.

As a result of our study, we create a new benchmark comprising 53,100 zero-shot multi-problem prompts. We call it ZeMPE, which stands for Zero-shot Multi-Problem Evaluation. We release ZeMPE to aid future MPE research.

BibTeX

@inproceedings{wang-etal-2025-evaluating,
    title = "Evaluating {LLM}s with Multiple Problems at once",
    author = "Wang, Zhengxiang  and
      Kodner, Jordan  and
      Rambow, Owen",
    editor = "Arviv, Ofir  and
      Clinciu, Miruna  and
      Dhole, Kaustubh  and
      Dror, Rotem  and
      Gehrmann, Sebastian  and
      Habba, Eliya  and
      Itzhak, Itay  and
      Mille, Simon  and
      Perlitz, Yotam  and
      Santus, Enrico  and
      Sedoc, Jo{\~a}o  and
      Shmueli Scheuer, Michal  and
      Stanovsky, Gabriel  and
      Tafjord, Oyvind",
    booktitle = "Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM{\texttwosuperior})",
    month = jul,
    year = "2025",
    address = "Vienna, Austria and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.gem-1.14/",
    pages = "178--199",
    ISBN = "979-8-89176-261-9"
}

More Works

Measuring Iterative Temporal Reasoning with Time Puzzles

LVLMs are Bad at Overhearing Human Referential Communication

Catch Me If You Can? Not Yet

LLMs can Perform Multi-Dimensional Analytic Writing Assessments

Clustering Document Parts

Learning Transductions and Alignments with RNN Seq2seq Models

Evaluating LLMs with Multiple Problems at once

Abstract

Main Results

Average accuracy of 7 LLMs across four classification-related tasks and task sizes in MPE.

Cost/accuracy ratio (lower is better) for SingleClf versus BatchClf across six benchmarks.

Conclusion

BibTeX