Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents

Wang, Zhengxiang; Rambow, Owen

Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents

Zhengxiang Wang, Owen Rambow

NLP+CSS @ NAACL 2024

Abstract

We propose a novel clustering pipeline to detect and characterize influence campaigns from documents. This approach clusters parts of document, detects clusters that likely reflect an influence campaign, and then identifies documents linked to an influence campaign via their association with the high-influence clusters. Our approach outperforms both the direct document-level classification and the direct document-level clustering approach in predicting if a document is part of an influence campaign. We propose various novel techniques to enhance our pipeline, including using an existing event factuality prediction system to obtain document parts, and aggregating multiple clustering experiments to improve the performance of both cluster and document classification. Classifying documents after clustering not only accurately extracts the parts of the documents that are relevant to influence campaigns, but also captures influence campaigns as a coordinated and holistic phenomenon. Our approach makes possible more fine-grained and interpretable characterizations of influence campaigns from documents.

Main Results

The pipeline improves document-level campaign detection substantially over baselines. With XGBoost and sentence-level clustering plus aggregation, the best reported F1 reaches 77.8 (precision 86.5, recall 70.7), compared with 50.7 F1 for direct document classification and 43.8 F1 for document-level clustering baselines.

Pipeline illustration: clustering document parts, classifying clusters, and projecting results to documents.

Overview of the clustering-based influence-campaign detection pipeline.

Aggregating high-influence clusters across many clustering configurations consistently improves F1 versus averaging single clustering runs. The gain is strongest when combined with XGBoost cluster classification, yielding higher recall while preserving strong precision in the final document predictions.

Performance curves comparing aggregation and no-aggregation settings for XGBoost-based cluster classification.

Aggregation versus no aggregation with XGBoost for high-influence cluster classification.

Conclusion

We have presented a new approach to finding influence campaigns, which relies on four core features: (1) we cluster parts of documents; (2) we classify clusters of parts of documents using non-lexical features; (3) we relate the classification result back to documents; (4) we use cluster aggregation, the use of many clustering runs over the same dataset, to augment training data for the cluster classifier. The resulting classification of the documents does not only show a predicted label for the document (part of influence campaign or not), but it also shows which parts of the document are responsible for this classification. We believe that our general approach can profit other document classification tasks, including detecting scientific influence in published papers, or themes in literature.

There are several avenues for possible future work and we list three below. (1) Datasets. Given the increasing importance of detecting influence campaigns, we hope there will be more datasets annotated on the document collection level for an influence campaign. (2) Incorporating non-textual information. Our current pipeline is a text-only system. Leveraging non-textual information, such as social interactions and the authors' past activities, may help us create a more complicated and comprehensive system (e.g., using graph neural network) that enhances the accurate and reliable detection of influence campaigns. However, such work cannot be possible without good datasets. (3) Automatic characterization of influence campaigns. Our work captures influence campaigns by the high-influence clusters, which may contain a large number of semantically related document parts, possibly with noise. To fully make sense of these clusters, we need to have some automatic ways of characterizing them in a fine-grained and interpretable way aligned with the downstream needs. Our preliminary experiments show that LLMs may be a potential option.

BibTeX

@inproceedings{wang-rambow-2024-clustering,
    title = "Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents",
    author = "Wang, Zhengxiang  and
      Rambow, Owen",
    editor = "Card, Dallas  and
      Field, Anjalie  and
      Hovy, Dirk  and
      Keith, Katherine",
    booktitle = "Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.nlpcss-1.10/",
    doi = "10.18653/v1/2024.nlpcss-1.10",
    pages = "132--143"
}

More Works

Measuring Iterative Temporal Reasoning with Time Puzzles

LVLMs are Bad at Overhearing Human Referential Communication

Catch Me If You Can? Not Yet

LLMs can Perform Multi-Dimensional Analytic Writing Assessments

Evaluating LLMs with Multiple Problems at once

Learning Transductions and Alignments with RNN Seq2seq Models

Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents

Abstract

Main Results

Overview of the clustering-based influence-campaign detection pipeline.

Aggregation versus no aggregation with XGBoost for high-influence cluster classification.

Conclusion

BibTeX