Abstract
We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, Time Puzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset's simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, Time Puzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.
Main Results
Time Puzzles remains difficult for current tool-less LLMs despite low-cost, easy-to-verify construction: GPT-5 reaches 49.3% exact match and all other evaluated models stay below 31%. This figure shows exact-match trends across solution counts and that web search consistently improves performance, but does not close the gap between implicit constraints and explicit-date counterparts.
Average exact match accuracy across solution counts, with and without web search (solution counts 1, 3, 5 for web-search conditions).
The Code Interpreter analysis on single-solution puzzles shows mixed effects: enabling Code Interpreter degrades GPT-5 on implicit constraints (with and without web search), while GPT-4.1 benefits from it, especially with web search. Performance on explicit constraints remains mostly stable, and output-token reductions are more consistent for GPT-5.
Exact-match accuracy and output tokens for GPT-5 and GPT-4.1 under different conditions (+Web, +CI) on single-solution puzzles.
Conclusion
We propose Time Puzzles, a constraint-based date inference task that targets iterative, tool-augmented temporal reasoning. Although puzzles are synthetically generated and easy to verify, they expose persistent failures of current instruction-and reasoning-tuned LLMs to reliably resolve implicit temporal constraints, even with tool access. Overall, Time Puzzles offers a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning and can be systematically extended to support more challenging evaluations.
BibTeX
@article{wang2026timepuzzles,
title={Measuring Iterative Temporal Reasoning with Time Puzzles},
author={Wang, Zhengxiang and Dong, Zeyu},
journal={arXiv preprint arXiv:2601.07148},
year={2026},
url={https://arxiv.org/abs/2601.07148}
}