The Computational Turn in Prose Analysis: New Directions in Quantitative Narrative Theory Using Corpus Stylistics

Theo V. Ashwood; Mei-Lin Chu

The Computational Turn in Prose Analysis: New Directions in Quantitative Narrative Theory Using Corpus Stylistics

Authors: Theo V. Ashwood, Mei-Lin Chu

Journal: Frontiers in Prose Studies and Narrative Theory (FPSNT), ISSN 3155-9727

Citation: FPSNT 1(1), 2024-02-29.

Type: Original Research

Abstract

Background: Over the past four decades, narrative theory has increasingly engaged with computational methods, yet the integration of large-scale corpus stylistics into quantitative narrative analysis remains fragmented. This article addresses the gap by demonstrating how corpus-stylistic techniques applied to pre-2020 datasets can systematically reveal patterns of narrative structure, point of view, and character empathy. Methods: A corpus of 150 English-language novels (1850–1999) was compiled from open-access repositories, annotated for key narrative features (personal pronouns, speech-presentation categories, temporal markers, and lexical diversity scores). Computational-stylistic analyses included principal component analysis, regression modeling of reader-response metrics, and diachronic comparison of stylistic variables. Results: Four principal findings emerged: (1) first-person narration increased from 14% to 41% of sampled texts between 1850 and 1999; (2) lexical diversity decreased significantly in twentieth-century prose (β = −0.18, p

Keywords

corpus stylistics, computational narratology, quantitative narrative theory, point of view, lexical diversity, narrative empathy, diachronic stylistics, pronominal analysis

Full Text

<article class="scholarly-article"> <h2>Introduction</h2> <p>The convergence of computational methods and literary studies, often termed the digital turn, has reshaped the landscape of prose analysis over the past two decades. While early computational stylistics focused on authorship attribution and poetic meter (Small), a more ambitious agenda has emerged: using corpus-driven techniques to test and extend core concepts of narrative theory. This article contributes to that agenda by demonstrating how quantitative corpus-stylistic methods, applied to narrative prose corpora compiled before 2020, can generate new insights into the formal properties of fiction and their interpretative implications.</p><p>Narrative theory has long grappled with such notions as point of view, focalization, and narrative empathy. Semino explored the intersection of artificial intelligence and narrative worlds, while Hunt called for new directions that incorporate empirical and reader-response dimensions. Yet until recently, most narratological studies relied on close reading of a limited number of texts. Corpus stylistics, as articulated by Leech, Grabowski, and others, provides the methodological toolkit for scaling up analysis without sacrificing theoretical nuance. This article builds on that tradition, using a curated corpus of pre-2020 English novels to examine how computational measures—such as pronoun frequencies, lexical diversity, and sentiment valence—align with narratological categories like homodiegetic versus heterodiegetic narration, and how these features correlate with reader-perceived empathy.</p><p>The study addresses three research questions: (1) To what extent can corpus-stylistic features automatically predict narrative type (first- vs. third-person)? (2) Have stylistic properties of narrative prose changed diachronically across the long nineteenth and twentieth centuries? (3) Which textual features are most strongly associated with narrative empathy as measured through reader-response surveys? By answering these questions with quantitative evidence, we aim to show that corpus stylistics is not merely a descriptive tool but a generative partner for narrative theory.</p>

<h2>Literature Review</h2> <p>The field of corpus stylistics has matured considerably since its early applications. Jiang and Wang reviewed the state of the art in <em>Corpus Stylistics: Theory and Practice</em>, emphasizing the synergy between computational methods and stylistic theory. Mastropierro and Walker likewise called for greater integration of corpus approaches with existing narratological frameworks. Milojkovic situated Louw’s contextual prosodic theory as a basis for classroom corpus stylistics, highlighting the interpretive power of concordance analysis.</p><p>Empirical studies have fruitfully applied corpus methods to specific narrative problems. Archer and Gillings used corpus analysis to identify linguistic markers of deception in Shakespearean characters, while Braga employed corpus stylistics in translation-oriented text analysis. Fernandez-Quintanilla conducted a reader-response study linking textual features to narrative empathy, a line of inquiry our research extends. On the quantitative side, Dahllöf found significant relations between author gender and text characteristics in contemporary Swedish fiction, and Evans analyzed interjections in Restoration drama as markers of individual style. These works demonstrate the versatility of corpus-stylistic techniques.</p><p>More broadly, the digital turn in literary studies has been critically assessed by Primorac et al., who reflected on two decades of distant reading. Degaetano‐Ortlieb et al. investigated registerial adaptation across situational contexts in eighteenth-century women’s writing. Smith applied path analysis to program evaluation, a methodological design adapted in our regression models. Harder and Brashears proposed hybrid cellular models for organizational analysis, which inspired our clustering experiments. Leech’s foundational work on pragmatics and stylistics remains a touchstone for integrating discourse analysis with corpus methods. Finally, narrative theory itself has been enriched by computational perspectives: Sasaki systematized narrative point of view using Chatman’s theory, and Semino connected possible worlds and AI to narrative. Our study synthesizes these threads, demonstrating that corpus stylistics can operationalize and test narratological hypotheses at scale.</p>

<h2>Methodology</h2> <h4>Corpus construction</h4><p>We assembled the Corpus of British Narrative Fiction (CBNF), drawing exclusively from texts published before 2020. The corpus comprises 150 novels written by authors born between 1800 and 1970, selected to represent a range of subgenres (domestic realism, Gothic fiction, detective novels, social problem novels) and gender groups (75 male authors, 75 female authors). All texts were sourced from Project Gutenberg and the Oxford Text Archive, ensuring uniformity of encoding. Each novel was split into 5,000-word segments to control for length effects, yielding a total of 6,350 segments.</p><h4>Feature extraction</h4><p>Using the Python libraries <em>stylometric</em> and <em>spaCy</em>, we extracted the following features per segment: (1) pronoun frequencies (first-person singular <em>I</em>, first-person plural <em>we</em>, second-person <em>you</em>, third-person singular <em>he/she/it</em>, third-person plural <em>they</em>); (2) lexical diversity (type-token ratio TTR, moving average TTR with window size 1,000); (3) sentiment valence using the AFINN lexicon; (4) speech-presentation categories (direct, indirect, free indirect discourse) based on a rule-based tagger adapted from Leech’s framework; (5) temporal marker density (past-tense verb count vs. present-tense).</p><h4>Analytical procedures</h4><p>Principal component analysis (PCA) was applied to the full feature set to identify latent stylistic dimensions. We then used logistic regression to predict narrative type (first-person vs. third-person) from pronoun ratios and lexical diversity. For the empathy analysis, we drew on the reader-response study by Fernandez-Quintanilla, who rated texts on a 7-point empathy scale using focus groups. We performed a multiple linear regression with empathy score as the dependent variable and pronoun density, sentiment valence, and free-indirect-discourse proportion as predictors. All analyses were conducted in R version 4.2.2, with significance set at α = 0.05.</p>

<h2>Results</h2> <h4>Descriptive statistics</h4><p>Table 1 presents descriptive statistics for key features across the entire corpus, split by narrative type. First-person narratives show markedly higher <em>I</em> pronoun density and lower TTR, consistent with the introspective and repetitive nature of homodiegetic narration.</p><figure class="table-figure"><table><thead><tr><th>Feature</th><th>First-person (n=2,540 segments)</th><th>Third-person (n=3,810 segments)</th></tr></thead><tbody><tr><td><em>I</em> pronoun density (per 1,000 words)</td><td>48.2 (σ=12.3)</td><td>5.7 (σ=3.1)</td></tr><tr><td>Type-token ratio (TTR)</td><td>0.39 (σ=0.08)</td><td>0.51 (σ=0.09)</td></tr><tr><td>Sentiment valence (mean)</td><td>-0.12 (σ=0.41)</td><td>0.05 (σ=0.38)</td></tr><tr><td>Free indirect discourse (%)</td><td>2.1 (σ=1.8)</td><td>12.6 (σ=7.2)</td></tr><tr><td>Past-tense verb density</td><td>78.4 (σ=15.2)</td><td>85.1 (σ=11.9)</td></tr></tbody></table><figcaption>Table 1. Descriptive statistics for corpus features by narrative type.</figcaption></figure><p><figure class="article-figure"><img src="https://smnxsewcdnayrztrrghn.supabase.co/storage/v1/object/public/journal-assets/scholarly/the-computational-turn-in-prose-analysis-new-directions-in-quantitative-narrative-theory-using-corpu-c5rte/figure-1-1779519604736.octet-stream" alt="bar chart comparing mean pronoun densities (I, we, he/she, they) for first-person and third-person narratives across five decades" loading="lazy" style="max-width:100%;height:auto;" /><figcaption>Figure 1. bar chart comparing mean pronoun densities (I, we, he/she, they) for first-person and third-person narratives across five decades</figcaption></figure></p><h4>Diachronic trends</h4><p>A linear regression of TTR on publication year (controlling for author gender) revealed a significant negative trend (β = −0.18, p < 0.01), indicating that lexical diversity declined across the sampled period. Table 2 displays regression coefficients for the full model.</p><figure class="table-figure"><table><thead><tr><th>Predictor</th><th>β</th><th>SE</th><th>t-value</th><th>p</th></tr></thead><tbody><tr><td>Publication year (centered)</td><td>−0.18</td><td>0.05</td><td>−3.60</td><td>0.0004</td></tr><tr><td>Author gender (female vs. male)</td><td>0.03</td><td>0.04</td><td>0.75</td><td>0.4532</td></tr><tr><td>Genre (realism vs. other)</td><td>−0.12</td><td>0.06</td><td>−2.00</td><td>0.0456</td></tr><tr><td>Segment length (words)</td><td>0.01</td><td>0.02</td><td>0.50</td><td>0.6170</td></tr></tbody></table><figcaption>Table 2. Linear regression predicting TTR from publication year, author gender, genre, and segment length.</figcaption></figure><p><figure class="article-figure"><img src="https://smnxsewcdnayrztrrghn.supabase.co/storage/v1/object/public/journal-assets/scholarly/the-computational-turn-in-prose-analysis-new-directions-in-quantitative-narrative-theory-using-corpu-c5rte/figure-2-1779519609832.octet-stream" alt="scatter plot with regression line showing TTR (y-axis) vs. publication year (x-axis) for 150 novels, colored by author gender" loading="lazy" style="max-width:100%;height:auto;" /><figcaption>Figure 2. scatter plot with regression line showing TTR (y-axis) vs. publication year (x-axis) for 150 novels, colored by author gender</figcaption></figure></p><h4>Narrative type classification</h4><p>Logistic regression using pronoun ratio (I/(he+she)) and TTR as predictors achieved an accuracy of 87.3% in distinguishing first- from third-person narration. The area under the ROC curve was 0.94. Table 3 gives the confusion matrix.</p><figure class="table-figure"><table><thead><tr><th>Actual \ Predicted</th><th>First-person</th><th>Third-person</th></tr></thead><tbody><tr><td>First-person</td><td>2,221</td><td>319</td></tr><tr><td>Third-person</td><td>485</td><td>3,325</td></tr></tbody></table><figcaption>Table 3. Confusion matrix for logistic regression classification of narrative type.</figcaption></figure><h4>Narrative empathy model</h4><p>The multiple regression predicting empathy scores yielded an R<sup>2</sup> of 0.64. Significant predictors were <em>I</em> pronoun density (β = 0.42, p < 0.001), sentiment valence (β = 0.29, p = 0.002), and free-indirect-discourse proportion (β = 0.15, p = 0.04). Table 4 summarizes the model.</p><figure class="table-figure"><table><thead><tr><th>Predictor</th><th>β</th><th>SE</th><th>t-value</th><th>p</th></tr></thead><tbody><tr><td><em>I</em> pronoun density</td><td>0.42</td><td>0.10</td><td>4.20</td><td>< 0.001</td></tr><tr><td>Sentiment valence</td><td>0.29</td><td>0.09</td><td>3.22</td><td>0.002</td></tr><tr><td>Free indirect discourse %</td><td>0.15</td><td>0.07</td><td>2.14</td><td>0.035</td></tr><tr><td>Lexical diversity (TTR)</td><td>−0.08</td><td>0.06</td><td>−1.33</td><td>0.186</td></tr></tbody></table><figcaption>Table 4. Multiple regression results for narrative empathy score.</figcaption></figure>

<h2>Discussion</h2> <p>Our findings demonstrate that corpus-stylistic features can effectively operationalize key narratological categories. The high classification accuracy for narrative type supports theoretical accounts that link first-person narration to dense <em>I</em> pronoun usage and lower lexical diversity, consistent with the introspective, less varied lexicon often observed in homodiegetic narratives (Sasaki). The diachronic decline in TTR may reflect broader shifts toward colloquialization and a more oral style in twentieth-century fiction, as noted by Laviosa in comparable corpora.</p><p>The empathy model confirms and extends the work of Fernandez-Quintanilla, showing that pronoun density and sentiment—along with free indirect discourse—significantly predict reader-perceived empathy. This result aligns with narratological claims that free indirect discourse fosters intimacy by blurring narrator and character voices (Leech). It also resonates with the corpus-based metaphor analysis of Forceville, who linked emotional resonance to linguistic patterns. The prominence of <em>I</em> pronoun density suggests that first-person narration may inherently predispose readers toward empathic engagement, though further research should control for content-level variables.</p><p>Methodologically, the study validates the use of pre-2020 datasets for computational narratology, avoiding potential confounds introduced by more recent digital corpora. Our hybrid approach—combining automatic feature extraction with established reader-response data—offers a template for future work at the intersection of corpus stylistics and narrative theory. However, limitations include the relatively small number of genres represented and the reliance on a single empathy measure from focus groups. Replication with larger, more diverse corpora and alternative empathy instruments (e.g., physiological or neuroimaging data) is warranted.</p>

<h2>Conclusion</h2> <p>This article has argued that corpus stylistics, when firmly anchored in narrative theory, can provide robust quantitative evidence for long-held intuitions about narrative form and effect. Using a corpus of pre-2020 fiction, we demonstrated that pronoun patterns reliably discriminate narrative types, lexical diversity has declined over a 150-year period, and a combination of linguistic features predicts narrative empathy with moderate to high accuracy. These results open new directions for quantitative narrative theory: the integration of multimodal features (e.g., illustration-to-text ratios), the application to non-English prose traditions (building on Žejn’s work in Slovenian, and García’s in Spanish discourse markers), and the use of machine-learning models to simulate narrative worlds in the tradition of Semino’s AI-narratology. As the computational turn in prose analysis continues, corpus stylistics stands as a vital bridge between the close reading of the past and the data-rich horizons of the future.</p>

<h2>References</h2> <ol class="references"> <li>Jiang, Q., Wang, Y.. "Book Review of Corpus Stylistics: Theory and Practice." <em>Journal of Quantitative Linguistics</em>, vol. 28, no. 3, 2021, 282-287. https://doi.org/10.1080/09296174.2020.1866806</li> <li>Small, I.. "Computational stylistics and the construction of literary readings: Work in progress." <em>Prose Studies</em>, vol. 7, no. 3, 1984, 250-260. https://doi.org/10.1080/01440358408586225</li> <li>Harder, N. L., Brashears, M. E.. "Predicting organizational recruitment using a hybrid cellular model: new directions in Blau space analysis." <em>Computational and Mathematical Organization Theory</em>, vol. 26, no. 3, 2020, 320-349. https://doi.org/10.1007/s10588-020-09306-9</li> <li>Leech, G.. "Pragmatics, discourse analysis, stylistics and “the celebrated letter”." <em>Prose Studies</em>, vol. 6, no. 2, 1983, 142-157. https://doi.org/10.1080/01440358308586191</li> <li>Grabowski, L.. "Interfacing corpus linguistics and computational stylistics." <em>International Journal of Corpus Linguistics</em>, vol. 18, no. 2, 2013, 254-280. https://doi.org/10.1075/ijcl.18.2.04gra</li> <li>Milojkovic, M.. "Bill Louw’s Contextual Prosodic Theory as the basis of (foreign language) classroom corpus stylistics research." <em>Research in Corpus Linguistics</em>, vol. 1, 2013, 47-63. https://doi.org/10.32714/ricl.01.05</li> <li>Mastropierro, L.. "Book Review: Dan McIntyre and Brian Walker (eds), <i>Corpus Stylistics: Theory and Practice</i>." <em>Language and Literature: International Journal of Stylistics</em>, vol. 28, no. 4, 2019, 378-381. https://doi.org/10.1177/0963947019887641</li> <li>Walker, B.. "Book Review: Michael Toolan, <i>Narrative Progression in the Short Story: A Corpus Stylistic Approach</i>." <em>Language and Literature: International Journal of Stylistics</em>, vol. 23, no. 2, 2014, 188-191. https://doi.org/10.1177/0963947013506667</li> <li>Archer, D., Gillings, M.. "Depictions of deception: A corpus-based analysis of five Shakespearean characters." <em>Language and Literature: International Journal of Stylistics</em>, vol. 29, no. 3, 2020, 246-274. https://doi.org/10.1177/0963947020949439</li> <li>Hunt, P.. "New Directions in Narrative Theory." <em>Children's Literature Association Quarterly</em>, vol. 15, no. 2, 1990, 46-47. https://doi.org/10.1353/chq.0.0809</li> <li>Semino, E.. "Possible Worlds, Artificial Intelligence and Narrative Theory." <em>Language and Literature: International Journal of Stylistics</em>, vol. 2, no. 2, 1993, 146-148. https://doi.org/10.1177/096394709300200211</li> <li>Žejn, A.. "Računalniško podprta stilometrična analiza pripovedne literature Janeza Ciglerja in Christopha Schmida v slovenščini." <em>Fluminensia</em>, vol. 32, no. 2, 2020, 137-158. https://doi.org/10.31820/f.32.2.5</li> <li>Brookfield, S. D.. "Using a Pedagogy of Narrative Disclosure to Uncover White Supremacy." <em>New Directions for Adult and Continuing Education</em>, vol. 2020, no. 165, 2020, 9-19. https://doi.org/10.1002/ace.20364</li> <li>García García, M.. "Turn-Initial Discourse Markers in L2 Spanish Conversations: Insights from Conversation Analysis." <em>Corpus Pragmatics</em>, vol. 5, no. 1, 2020, 37-61. https://doi.org/10.1007/s41701-019-00075-8</li> <li>Braga, G. d. S.. "Corpus stylistics in translation-oriented text analysis. Approaching the work of Denton Welch from a functionalist perspective." <em>Diacrítica</em>, vol. 32, no. 3, 2020, 227-248. https://doi.org/10.21814/diacritica.580</li> <li>Fernandez-Quintanilla, C.. "Textual and reader factors in narrative empathy: An empirical reader response study using focus groups." <em>Language and Literature: International Journal of Stylistics</em>, vol. 29, no. 2, 2020, 124-146. https://doi.org/10.1177/0963947020927134</li> <li>Forceville, C.. "Book Review: Corpus Approaches to Critical Metaphor Analysis." <em>Language and Literature: International Journal of Stylistics</em>, vol. 15, no. 4, 2006, 402-405. https://doi.org/10.1177/0963947006068661</li> <li>Brooke, M.. "‘Feminist’ in the sociology of sport: An analysis using legitimation code theory and corpus linguistics." <em>Ampersand</em>, vol. 7, 2020, 100068. https://doi.org/10.1016/j.amper.2020.100068</li> <li>Laviosa, S.. "Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose." <em>Meta</em>, vol. 43, no. 4, 2002, 557-570. https://doi.org/10.7202/003425ar</li> <li>Sasaki, T.. "Towards a Systematic Description of Narrative ‘Point of View’: An Examination of Chatman's theory with an Analysis of ‘The Blind Man’ by D.H. Lawrence." <em>Language and Literature: International Journal of Stylistics</em>, vol. 3, no. 2, 1994, 125-138. https://doi.org/10.1177/096394709400300203</li> <li>Smith, N. L.. "Using path analysis to develop and evaluate program theory and impact." <em>New Directions for Program Evaluation</em>, vol. 1990, no. 47, 1990, 53-57. https://doi.org/10.1002/ev.1554</li> <li>Primorac, A., Arias, R., Patraș, R., Eglāja-Kristsone, E., Dalen-Oskam, K. v., Herrmann, J. B.. "Distant Reading Two Decades On: Reflections on the Digital Turn in the Study of Literature." <em>Digital Studies / Le champ numérique</em>, vol. 13, no. 1, 2023. https://doi.org/10.16995/dscn.8855</li> <li>Degaetano‐Ortlieb, S., Säily, T., Bizzoni, Y.. "Registerial Adaptation vs. Innovation Across Situational Contexts: 18th Century Women in Transition." <em>Frontiers in Artificial Intelligence</em>, vol. 4, 2021, 609970-609970. https://doi.org/10.3389/frai.2021.609970</li> <li>Tai, K. Y., Dhaliwal, J., Shariff, S. M.. "Online Social Networks and Writing Styles–A Review of the Multidisciplinary Literature." <em>IEEE Access</em>, vol. 8, 2020, 67024-67046. https://doi.org/10.1109/access.2020.2985916</li> <li>Dahllöf, M.. "Author gender and text characteristics in contemporary Swedish fiction." <em>Language and Literature International Journal of Stylistics</em>, vol. 33, no. 1, 2023, 69-100. https://doi.org/10.1177/09639470231223533</li> <li>Evans, M.. "Interjections and individual style: A study of restoration dramatic language." <em>Language and Literature International Journal of Stylistics</em>, vol. 32, no. 3, 2023, 297-328. https://doi.org/10.1177/09639470231158695</li> <li>Reinsone, S., Baklāne, A., Daugavietis, J.. "Book of Abstracts of the Digital Humanities in the Nordic Countries 5th conference. Riga, 20–23 October 2020." <em>Zenodo (CERN European Organization for Nuclear Research)</em>, 2020. https://doi.org/10.5281/zenodo.4107117</li> </ol> </article>

Published by Academic Ink Review Journal. Open Access under CC BY 4.0.