Evaluating LLMs with Multiple Problems at once

GEM 2025

Abstract

This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.

Main Results

Conclusion

In this study, we present a comprehensive and systematic MPE of LLMs. We evaluate various LLMs from 4 model families on single-source multi-problem prompts constructed from 6 classification and 12 reasoning benchmarks. In line with previous few-shot results, we confirm that LLMs are competent multi-problem solvers for classification and reasoning under zero-shot settings. Moreover, we find multiple pieces of evidence that validate the strong innate multiple problem handling capabilities of LLMs, such as the similar classification errors LLMs make under SPP and MPP, the lack of obvious positional biases, and the transferrability of zero-shot-CoT under MPP. Leveraging the strong multiple problem handling capabilities, we show that zero-shot MPP can be cost-efficient.

Two conditions are identified under which LLMs show consistent performance declines with MPP: (1) reformulating Batch Classification as index selection tasks; and (2) mixing reasoning problems from different sources in a multi-problem prompt. Noticeably, these performance declines happen even when the number of problems included is rather small (e.g., <= 5), which may not be human-like and indicates a lack of true understanding. In addition, we explore several model-level factors that may enable MPP and find instruction tuning to be an important factor that enhances MPP.

Overall, our experiment demonstrate surprisingly consistent observations across different LLMs and across multi-problem prompts constructed from various benchmarks. This consistency indicates the reliability and fruitfulness of MPE as an evaluation paradigm.

As a result of our study, we create a new benchmark comprising 53,100 zero-shot multi-problem prompts. We call it ZeMPE, which stands for Zero-shot Multi-Problem Evaluation. We release ZeMPE to aid future MPE research.

BibTeX

@inproceedings{wang-etal-2025-evaluating,
    title = "Evaluating {LLM}s with Multiple Problems at once",
    author = "Wang, Zhengxiang  and
      Kodner, Jordan  and
      Rambow, Owen",
    editor = "Arviv, Ofir  and
      Clinciu, Miruna  and
      Dhole, Kaustubh  and
      Dror, Rotem  and
      Gehrmann, Sebastian  and
      Habba, Eliya  and
      Itzhak, Itay  and
      Mille, Simon  and
      Perlitz, Yotam  and
      Santus, Enrico  and
      Sedoc, Jo{\~a}o  and
      Shmueli Scheuer, Michal  and
      Stanovsky, Gabriel  and
      Tafjord, Oyvind",
    booktitle = "Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM{\texttwosuperior})",
    month = jul,
    year = "2025",
    address = "Vienna, Austria and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.gem-1.14/",
    pages = "178--199",
    ISBN = "979-8-89176-261-9"
}