Explaining Large Language Models for Passage-Level Political Statement Extraction Using Linguistic Rule-Based Models
Ph.D. thesis in NLP by Juan-Francisco (Paco) Reyes
Download the preview of the thesis (PDF) »
The automatic extraction of coherent political statements at the passage level is an opportune task in the current era of increasing digitization of political discourse (Atteveldt et al., 2017; Lippi & Torroni, 2016; Won, 2019). The complexity of language in political discourse presents a challenge due to its ambiguity and nuance. However, the growing volume of digital political content on the World Wide Web (WWW) and the proven efficiency of large language models (LLMs) in detecting and extracting such statements presents an opportunity to explore new approaches. This task has recently gained significant attention across journalism, political science, linguistics, and other research fields.
Given the intricate nature of political language, explainable artificial intelligence (XAI) within LLMs is essential to ensure transparent, interpretable, and reliable extractions. In this context, this thesis explores tools and methods for explaining LLM behavior prediction from a linguistic perspective. The research addresses the problem of accurately extracting "relevant" political statements by operationalizing relevance as statements that exhibit a clear stance toward a political issue. A prototype extraction system was initially developed using traditional information extraction (IE) techniques, relying on rules and symbolic knowledge representations. Then, to address the limitations of these methods, the backbone of a proposed extraction system was designed: a natural language processing (NLP) pipeline that integrates state-of-the-art LLMs.
What makes this research different is its scientific approach to enhancing explainability through the combination of linguistic rule-based models (LRBMs) and contemporary explainability tools like SHAP (SHapley Additive exPlanations) (Nohara et al., 2019) and Transformers Interpret (Pierse, 2024). By leveraging corpus linguistics, tailored lexicons, and lexicogrammatical rules, our research adds transparency to LLMs' decision-making processes, bridging traditional linguistic theories with modern neural models. The conjunction of our emphasis on performance in the NLP modeling process and our linguistics-aware approach increased the accuracy of the explanation capacity of the developed models.
The empirical approach to this research is structured around three specific NLP challenges, each corresponding to a dedicated article: (1) Text Genre Classification, employing a Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2018) to distinguish between speeches and interviews based on the number of speakers, thereby addressing text genre identification and filtering; (2) Stance Classification, utilizing a few-shot learning model, Sentence Transformer Fine-Tuning (SetFit) (Tunstall et al., 2022), for binary classification of stance expressions, enhancing the detection of relevant political statements; and (3) Topic Continuity Analysis, using Sentence Pair Modeling (SPM) and BERT to analyze topic continuity between sentence pairs, aiming to ensure the coherence of extracted passages.
For each of these challenges, ad hoc datasets were created, focused on American political discourse in English. LRBMs played a crucial role in feature engineering and dataset annotation, further improving the explainability and performance of explainability tools. The fusion of linguistic insight with neural model outputs allowed for granular token-level analysis of morphosyntactic, semantic, and (sometimes) pragmatic features, offering a novel contribution by aligning these outputs with established linguistic theories. This thesis makes advances in the understanding of the explainability of LLM by approaching the extraction of relevant political statements from a linguistically-aware point of view.
Keywords: large language models, information extraction, NLP, political statements, political discourse analysis, XAI.
References
Atteveldt, W., Sheafer, T., Shenhav, S., & Fogel-Dror, Y. (2017). Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War. Political Analysis, 25, 207 - 222. https://doi.org/10.1017/pan.2016.12.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423.
Lippi, M., & Torroni, P. (2016). Argument Mining from Speech: Detecting Claims in Political Debates. , 2979-2985. https://doi.org/10.1609/aaai.v30i1.10384.
Nohara, Y., Matsumoto, K., Soejima, H., & Nakashima, N. (2019). Explanation of Machine Learning Models Using Improved Shapley Additive Explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. https://doi.org/10.1145/3307339.3343255.
Pierse, C. D. (2024). Transformers Interpret. GitHub repository. https://github.com/cdpierse/transformers-interpret.
Tunstall, L., Reimers, N., Seo Jo, U. E., Bates, L., Korat, D., Wasserblat, M., & Pereg, O. (2022). Efficient Few-Shot Learning Without Prompts. arXiv. https://arxiv.org/abs/2209.11055.
Won, M., Martins, B., & Raimundo, F. (2019). Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition. , 648-669. https://doi.org/10.29007/MMK4.