Abstract
This paper explores whether LLMs can perform multi-dimensional analytic writing assessments, that is, provide both scores and comments across multiple criteria. Using a corpus of 141 literature reviews written by L2 graduate students and assessed by human experts on nine analytic criteria, the study prompts several popular LLMs under different interaction settings. To evaluate feedback-comment quality, it applies a problem-focused evaluation framework (ProEval) designed to be interpretable, scalable, and reproducible. Overall, the paper finds that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments.
Main Results
Overall score-agreement heatmaps show a clear pattern: humans agree more with humans, and LLMs agree more with LLMs. Still, the paper reports that LLMs can score approximately like humans, with strong human-LLM adjacent agreement on several settings (best AAR1 between 0.59 and 0.88).
Overall QWK and AAR1 agreement among human and LLM assessors.
At criterion level, agreement varies: LLM scores align better with humans on some criteria (for example material selection, citation/integration, grammar, and academic vocabulary) and worse on others such as connector use. The paper also finds interaction mode effects on comment quality, where IM3 tends to produce more specific and potentially more helpful comments.
Criterion-level AAR1 between human-average scores and each assessor.
Conclusion
The study concludes that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments for graduate-level academic English writing. It highlights practical pedagogical potential for both L2 learners and instructors, and introduces ProEval as a time- and cost-efficient, interpretable, and reproducible framework for feedback-comment analysis. The released corpus and code support future work on deeper human-versus-LLM feedback characterization and stronger comparative metrics for comment quality.
BibTeX
@inproceedings{wang-etal-2025-llms-perform,
title = "{LLM}s can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of {L}2 Graduate-Level Academic {E}nglish Writing",
author = "Wang, Zhengxiang and
Makarova, Veronika and
Li, Zhi and
Kodner, Jordan and
Rambow, Owen",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.423/",
doi = "10.18653/v1/2025.acl-long.423",
pages = "8637--8663",
ISBN = "979-8-89176-251-0",
}