Abstract
As large language models (LLMs) become increasingly integrated into personal writing tools, a critical question arises: can LLMs faithfully imitate an individual's writing style from just a few examples? Personal style is often subtle and implicit, making it difficult to specify through prompts yet essential for user-aligned generation. This work presents a comprehensive evaluation of state-of-the-art LLMs' ability to mimic personal writing styles via in-context learning from a small number of user-authored samples. We introduce an ensemble of complementary metrics including authorship attribution, authorship verification, style matching, and AI detection to robustly assess style imitation. Our evaluation spans over 40,000 generations per model across domains such as news, email, forums, and blogs, covering writing samples from more than 400 real-world authors. Results show that while LLMs can approximate user styles in structured formats like news and email, they struggle with nuanced, informal writing in blogs and forums. Further analysis on prompting strategies such as number of demonstrations reveals key limitations in effective personalization. Our findings highlight a fundamental gap in personalized LLM adaptation and the need for improved techniques to support implicit, style-consistent generation. To aid future research and reproducibility, we open-source our data and code.
Main Results
Across authors, few-shot prompting consistently improves authorship attribution over zero-shot prompting, but performance still varies by dataset and author, indicating that implicit personal style imitation remains uneven and domain-sensitive.
Distribution of per-author AA accuracy averaged across all LLMs. Few-shot prompting achieves higher per-author accuracy than zero-shot.
Style-model analysis complements AA/AV results: generated texts move closer to target author style distributions under few-shot prompting, but the remaining distance, especially on informal domains, reflects a persistent personalization gap.
CCAT50
Blog
Distribution of average Mahalanobis distances to target author style models on representative datasets (lower is better).
Conclusion
This paper presents a comprehensive evaluation of state-of-the-art LLMs on their ability to mimic the implicit writing styles of everyday users through few-shot in-context learning. By combining authorship attribution, verification, stylometric modeling, and AI generation detection across four diverse datasets, we provide strong empirical evidence that despite improvements from exemplar-based prompting, current LLMs still struggle to reproduce nuanced personal styles, especially in informal and stylistically diverse domains. Our analysis further shows that prompt design choices, such as length alignment and content similarity, moderately affect stylistic fidelity but do not close the personalization gap. These findings highlight fundamental limitations in the stylistic adaptability of LLMs and suggest that achieving truly personalized generation remains an open challenge. Future work should explore richer personalization signals and hybrid prompting and/or finetuning strategies to better capture the subtleties of individual writing styles in real-world settings.
BibTeX
@inproceedings{wang-etal-2025-catch,
title = "Catch Me If You Can? Not Yet: {LLM}s Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors",
author = "Wang, Zhengxiang and
Tripto, Nafis Irtiza and
Park, Solha and
Li, Zhenzhen and
Zhou, Jiawei",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.532/",
doi = "10.18653/v1/2025.findings-emnlp.532",
pages = "10040--10055",
ISBN = "979-8-89176-335-7"
}