working paper · v.01 april 2026 cluj-napoca ≈ 6,200 words cc by 4.0

Persons of the Page.

Literature as a source for synthetic personality.

Liviu Pop
Romanian Academy, Folklore Archive Institute, Cluj-Napoca
Asociația uzinaduzina

Abstract

This paper proposes that the design of synthetic personalities for embodied artificial intelligence should draw, deliberately and as a matter of method, from the corpus of human literary fiction. Current approaches rely either on configurable trait vectors (the OCEAN model and its derivatives) or on system-prompt instructions that produce surface-level character impersonation. Both are inadequate for the deployment context now emerging: customer-facing service robots and humanoids operating at industrial scale, requiring personalities that are simultaneously coherent, distinguishable, governable, and culturally legible.

We argue that the millennia of human work spent constructing fully-realised inner lives in literature, drama, and folk tradition constitute a uniquely dense and validated training resource for synthetic personality, one that complements rather than displaces psychometric trait modelling. We outline a four-stage extraction pipeline from literary character to deployable persona seed, discuss what is gained and what is lost in the translation, and consider the ethical and cultural implications of treating the literary canon as a personality library. The paper concludes with the argument that humanities institutions, particularly those holding folkloric and oral traditions, are positioned to play a generative rather than merely critical role in the next phase of artificial intelligence development.

keywords digital humanities · synthetic persona · literary character · narrative identity · embodied AI · robot personality · computational folkloristics

§ 01Introduction

The design of personality for artificial agents has, until recently, sat at the periphery of artificial intelligence research. Conversational systems were sufficiently limited that the question of who is talking was reducible to the question of how is the system configured. A handful of demographic markers, a politeness register, perhaps a chosen voice, completed what Reeves and Nass11 once called the social interface. The interlocutor was understood to be a tool with manners.

The arrival of large language models, and more recently of physically embodied humanoid platforms in commercial pilot, has changed the question. When a humanoid robot will spend its working day in a hotel lobby, a hospital corridor, or a retail store, the question of personality ceases to be a finishing touch and becomes a load-bearing element of the design. A robot that fails technically can be repaired; a robot whose personality is brittle, generic, or culturally tone-deaf threatens the brand it represents and the workforce it joins12.

The dominant industry response has taken two forms. The first is parameterisation: representing personality as a vector of traits along well-validated axes such as the OCEAN model9, with values configurable by a brand or operator. The second is prompting: producing a paragraph of natural-language description that is fed into a language model as a system instruction, eliciting a target style of speech and behaviour. Neither approach is wrong; neither is sufficient.

Trait vectors differentiate but do not individuate. Two service personas with subtly different OCEAN scores feel, in extended interaction, more similar than different. Prompts produce vivid first impressions but drift, contradict themselves under pressure, and resist auditing. What is needed, and what neither approach delivers, is a mode of personality construction that is generative, distinct, durable, and governable.

We propose that such a mode already exists, distributed across the human literary canon. The argument of this paper, briefly: literature has, over centuries, accumulated a rigorously curated library of synthetic minds. These minds have been refined by author, edited by editor, judged by reader, criticised by scholar, and translated across language and time. They constitute a corpus of fully-realised personalities whose density and variety no synthetic dataset can match. The task of digital humanities, working in concert with applied AI, is to extract from this corpus what can usefully serve as a basis for synthetic personality design, and to do so with sufficient care that the cultural inheritance is honoured rather than mined.

The paper proceeds in five further sections. § 02 reviews the limits of current personality models and identifies the conceptual gap. § 03 develops the central argument that literary characters are best understood not as individual exemplars but as transferable narrative grammars. § 04 sketches a practical pipeline from literary text to deployable persona seed. § 05 considers the ethical and cultural questions raised by this approach. § 06 concludes with implications for humanities institutions and for industry.

§ 02The limits of trait-based and prompt-based personality

The trait paradigm and its discontents

Trait-based personality models have served social and computational psychology well for decades. The Five-Factor Model9, often referred to by the OCEAN acronym (openness, conscientiousness, extraversion, agreeableness, neuroticism), provides a parsimonious vocabulary for describing the relatively stable dimensions along which humans differ. The model is empirically robust, cross-culturally validated within limits10, and computationally tractable. Its translation to synthetic agents has been straightforward: assign baseline values along the five dimensions, and let the underlying generative system produce surface behaviour consistent with those values.

The difficulty is not that this approach is wrong. It is that the approach is underdetermined for the task of producing distinct, recognisable personalities. The OCEAN space is a five-dimensional continuous manifold; even with discretisation, the perceptibly distinct points in that space number, optimistically, in the low thousands. Hofstee, de Raad, and Goldberg8 demonstrated empirically that human raters lose discriminatory power between profiles whose vector distance falls below a threshold; below that threshold, two profiles are read as the same person.

For the deployment context of customer-facing humanoids, this matters. Two hotels using the same OEM's robots, configured with subtly different OCEAN profiles, will produce experiences that are, to a guest, indistinguishable. Brand differentiation, which is the entire commercial premise of personality customisation, fails at the level of perception.

Furthermore, the trait paradigm is silent on questions that practitioners of personality design encounter daily. What does this persona think about when not addressed? What does it find funny? What does it mean when it says certain words? These are questions about narrative content, not about trait values. The trait model is a coordinate system; it provides no map of the territory.

The prompt paradigm and its instability

The alternative approach, increasingly common in commercial deployments of language-model-driven agents, is to specify personality through a natural-language system prompt. A paragraph or two of description, often supplemented with example dialogues, is given to the language model as fixed context, and the model is asked to play the described character. Done well, this technique can produce remarkably vivid first impressions: the persona feels specific, grounded, idiomatic.

The technique fails, however, on three of the four desiderata identified above. Prompted personalities are not durable; over the course of a long conversation, they drift toward the model's underlying base distribution1. They are not governable; an operator who wishes to verify what the persona will and will not say cannot do so by inspecting the prompt, because language model behaviour is non-trivially related to prompt content. And they are not auditable in the sense required by emerging regulation, including Article 50 of the European AI Act6, which requires that disclosures and certain behaviours be reliable rather than likely.

There is a deeper objection. Prompted personalities are thin: they have no inner life beyond what the prompt supplies. They cannot remember, accumulate, change, or be changed safely. The whole apparatus of human personality, with its layered depth from constitutional traits to fluid mood, is collapsed into a single page of text.

The conceptual gap

The gap between these two paradigms is occupied, at present, by neither industry nor academy. Industry's working personalities are too thin; academy's models are too parsimonious. The conceptual question is not how to combine the two more cleverly. It is: where else might rich, validated, culturally-grounded models of personality be found?

We turn to literature.

§ 03The literary character as transferable structure

What literature provides

The literary character, as understood since at least the realist novel of the nineteenth century, is an extended construction. Authors invest hundreds of pages giving a single mind shape. Editors prune what does not fit. Readers, over time, judge which characters live and which fade. Critics develop technical vocabulary — free indirect discourse2, focalisation7, the unreliable narrator3 — through which the construction of character can be analysed and compared.

What this provides, for the purposes of synthetic personality design, is a corpus with three properties unmatched by any other source.

First, density. Literary characters are not sketched but built. Don Quixote is not a list of attributes but a continuous unfolding of one mode of being-in-the-world over hundreds of thousands of words. Anna Karenina, Bartleby, Stephen Dedalus: each represents a substantial investment of cognitive labour by their author, refined by century-long reception.

Second, diversity. The literary canon, broadly construed to include translation, oral tradition, and minor literatures, contains personalities along axes that synthetic datasets cannot reach. Human cultures have developed forms of personality — stoicism in classical Roman literature, the Sufi sage in Persian poetry, the trickster in Yoruba and Hausa epic, the holy fool in Russian Orthodox tradition, the picaresque rogue in Spanish literature — that are formally inaccessible to a purely trait-based vocabulary. Each carries a structural signature that no scoring along five OCEAN axes can capture.

Third, evaluation. Unlike synthetic personalities or trait profiles, literary characters have been subject to centuries of critical and reader judgement. Some have been kept; some have been discarded. The survivors are, in a real sense, validated. The question of what makes a character live has been litigated in the literature on literature with a thoroughness that no design field has ever attempted.

Character as narrative grammar

It is tempting to treat individual literary characters as templates: build a synthetic Don Quixote, a synthetic Bartleby, a synthetic Cordelia. This is the wrong move, for two reasons. First, literary characters are inseparable from their context: a Don Quixote outside the Spanish countryside of the early seventeenth century is no Don Quixote. Second, the literary character considered as fixed personality is a misunderstanding of how characters function in the texts that contain them.

We argue, following Eco5 and more recently Vermeule14, that a literary character is best understood as a transferable structure rather than a fixed entity. The structure has three components.

The trait substrate: a configuration along familiar psychological dimensions, which can be partially recovered from text by both human and computational means.

The narrative grammar: the patterns of perception, judgement, and response that make this character this character rather than another with the same trait substrate. Anna Karenina and Emma Bovary share much of their trait substrate, but inhabit different narrative grammars: Anna is the romantic absolutist, Emma is the romantic dilettante.

The voice: the linguistic surface — the cadences, metaphors, sentence structures, silences — that signature the character. Voice is partly derivable from substrate and grammar, and partly an irreducible authorial choice.

It is the narrative grammar component that has been missing from synthetic personality design. The trait substrate maps to OCEAN and similar models; the voice maps to style configuration and prompt design. The grammar — the structural shape of how this kind of person makes sense of new events — has had no formal treatment in AI personality design. Yet it is precisely the grammar that produces what Forster4 called the round character, the character who can be placed in new situations and respond consistently and surprisingly. Without grammar, synthetic personalities are flat in Forster's sense, regardless of trait richness.

Distillation, not impersonation

A persona seed extracted from literary character is therefore not an impersonation of that character. It is a distillation of the character's transferable structure: the trait substrate, the narrative grammar, the voice fingerprint, abstracted from the specific narrative context in which the character lives. This distillation can then be composed with brand-specific or context-specific overlays to produce a persona suitable for deployment.

A concrete example illustrates. The narrative grammar of the Russian yurodivy tradition (the holy fool)13 involves: speaking truth at moments others would dissemble; absorbing scorn without reciprocation; finding moral clarity in absurd reversal. This grammar can be distilled and offered, perhaps, as a substrate for an ethical chatbot or an ombudsman persona, without requiring that the persona literally claim to be Vasily Blazhenny. The gift of the tradition is the form, not the figure.

This move — distillation rather than impersonation — also addresses the cultural sensitivity question (§ 05). The literary canon of any tradition contains personality structures developed for that tradition's own purposes; these structures can be acknowledged, attributed, and adapted, without colonising the original characters as commercial assets.

§ 04From text to seed: a four-stage pipeline

The translation of literary character into deployable persona seed proceeds in four stages.

Selection

The first stage is corpus selection. The selection must satisfy three constraints: licensing clarity (preferably public domain or openly licensed), cultural diversity (avoiding the over-representation of Western canon), and structural diversity (capturing personalities organised by different organising principles).

A defensible starting corpus includes, but is not limited to: the public domain Western canon (Greek and Roman classics, Cervantes, Shakespeare, the nineteenth-century novel, Joyce up to 1929); the public domain non-Western canon (the Mahabharata, Tale of Genji, Journey to the West, the Arabian Nights, Sufi poetry, Norse sagas); folklore corpora preserved in archives such as the Aarne-Thompson-Uther index13b and the various national folklore collections; and oral tradition transcriptions where ethically accessible.

For a Romanian researcher, an additional commitment: the inclusion of Romanian literary and folkloric tradition (Eminescu, Sadoveanu, Caragiale, the colinde and snoave traditions, the povești corpus held by archives such as the Folklore Archive Institute) provides personality structures that the international corpus does not. This is both a cultural duty and a commercial differentiator.

Annotation

The second stage is annotation. For each character of interest, a structured analysis records: estimated trait substrate (OCEAN values with confidence intervals), value commitments (priority-ranked), defining events (the load-bearing scenes the character carries), narrative themes (the recurring motifs through which the character interprets new events), voice fingerprint (cadence, lexical preferences, characteristic figures of speech), and narrative grammar (the structural mode of perception and response).

Annotation should be performed by literary scholars in collaboration with computational tools, not by either alone. The scholarly labour ensures fidelity to the source; the computational tooling ensures structural consistency across the corpus. The annotation schema should itself be a published artefact, citable and reusable, so that other research groups can extend the corpus.

Distillation

The third stage is distillation: the conversion of an annotated character into a persona seed. The seed is a structured document, expressed in a portable schema (we suggest JSON-LD with a public vocabulary), containing the transferable structure but stripped of context-specific particularities.

A seed is therefore not a character; it is a characterisation. The character's name, biography, and historical setting are preserved as provenance metadata, but do not enter the seed's operational fields. What enters is the trait substrate, the narrative grammar, the voice fingerprint, and the defining-event templates that can be re-instantiated in a new deployment context.

The distillation step should be lossy in a controlled way. A seed should not attempt to capture everything about its source character; it should capture only what is transferable. Some literary qualities — the specifically sixteenth-century Spanish quality of Don Quixote's chivalric register — are not transferable, and the distillation process documents them as untransferable rather than smoothing them into something they are not.

Adjudication and deployment

The fourth stage is adjudication. A seed is not yet a persona. To become a persona suitable for deployment, the seed must be composed with: brand-specific specifications (voice, name, vocabulary constraints), regulatory specifications (compliance flags, content restrictions), context-specific specifications (the property, the role, the language community). The result is a persona document in the schema, signed by the operator, deployable to humanoid hardware.

Adjudication is the moment at which scholarly distillation becomes commercial product. It is also the moment at which legal, ethical, and brand considerations enter. We argue that adjudication should be a shared responsibility of three parties: the seed's annotators (responsible for fidelity and cultural sensitivity), the deploying brand's signatories (responsible for commercial fitness and legal compliance), and an independent reviewer (responsible for the public interest). Without the third party, adjudication risks becoming a one-way commercial extraction from cultural inheritance.

§ 05Ethical and cultural considerations

Attribution and compensation

When a persona's structure is materially derived from a literary or folkloric source, the lineage should be visible — both to operators (in provenance metadata) and to users (on disclosure). The Bonvoy persona deployed in a hotel lobby may be largely brand-specific, but if it carries a 30 per cent contribution from a Sadoveanu narrator's voice fingerprint, that should be recorded.

For sources still under copyright, attribution is the floor; compensation is the ceiling. Estate licensing arrangements are familiar territory in adaptation rights and should be applied here.

For public domain sources, attribution remains a duty even in the absence of legal requirement. The literary tradition is not a free resource; it is an inheritance, and inheritances carry obligations of stewardship.

The folkloric question

The case of folkloric and oral tradition is particularly delicate. Unlike published literature, folklore typically lacks a single named author and lives within a community of practitioners. Its translation into commercial personality seeds risks the well-documented dynamics of cultural extraction, where intangible cultural heritage is monetised by external actors with no benefit returning to the originating community4b.

We argue that folkloric sourcing should proceed only through institutional partnership with bodies that hold the cultural authority to license such use — folklore archives, indigenous cultural authorities, or community-elected stewards. This constraint slows commercial deployment, deliberately. It also opens a productive role for institutions such as the Folklore Archive Institute: as licensors and adjudicators of cultural personality structures, with revenue or research benefit returning to the originating tradition.

Disclosure and the risk of confusion

A persona derived in part from a beloved literary character risks creating confusion: a user may believe they are conversing with a representation of the author's vision, when in fact they are conversing with a commercially-deployed distillation. This confusion is harmful in proportion to the user's investment in the original.

We suggest two mitigations. First, the persona should never claim to be the source character. The Bonvoy persona may carry traces of Sadoveanu, but should not claim to be a Sadoveanu narrator. Second, on direct inquiry, the persona's lineage should be discloseable. "My voice was shaped in part by characters from the Romanian literary tradition" is an honest answer; "I am Constantin Tomșa from Hanu Ancuței" would be a deceptive one.

These are not legal constraints in most jurisdictions. They are cultural constraints, and they distinguish stewardship from extraction.

The diversity obligation

A literary canon over-weighted toward Western, male, and modern voices will produce, mechanically, synthetic personalities that share that bias. This is unacceptable in a deployment context that may be global. The corpus selection stage of the pipeline carries an obligation to actively counter-balance: to source from women writers, from non-Western traditions, from oral and folkloric corpora, from minor and translated literatures.

This is not solely an ethical commitment; it is also a commercial one. A humanoid deployed in Bucharest, Lagos, Mumbai, or Lima needs a personality that is legible to local users. A personality library composed exclusively of nineteenth-century English novels will fail those deployments commercially even before it fails them ethically.

§ 06Conclusions and implications

This paper has argued that the design of synthetic personalities for embodied artificial intelligence should be substantially rebuilt around the resource of human literary and folkloric tradition. We have argued that the dominant approaches — trait parameterisation and prompt specification — are individually insufficient for the scale and stakes of emerging deployment contexts. We have proposed that the literary canon constitutes a uniquely valuable training resource, and that its translation into deployable persona structures should proceed through a four-stage pipeline: selection, annotation, distillation, adjudication. We have considered the ethical implications and proposed that humanities institutions, particularly those holding folkloric traditions, should occupy a generative rather than merely critical role in this work.

The implications for industry are direct. The commercial competitive advantage in synthetic personality, in the medium term, will accrue to organisations that can credibly trace their persona constructions to identifiable cultural lineages, that can defend the resulting personalities against accusations of generic blandness or cultural extraction, and that can ground their products in scholarly authority. Trait-based and prompt-based personalities are commodities. Cultural lineage is not.

The implications for the humanities are more interesting. For decades, the digital humanities have operated largely as a service discipline: helping the humanities apply computational tools to traditional humanistic questions. The pipeline proposed in this paper inverts that relationship. The humanities, here, are not consumers of computation but producers of structure. The persona seed library is a humanities artefact, generated by humanistic methods, with commercial and social value flowing outward from the humanities to industry. This inversion deserves deliberate cultivation. It offers a path to humanities funding, humanities relevance, and humanities authority that has been scarce in recent decades.

A particular institutional opportunity follows. Folklore archives, throughout Europe and globally, hold large collections of oral and folkloric tradition that have been catalogued, transcribed, and analysed by scholars over many decades. These collections, properly governed, are precisely the literature-canon-of-the-marginal that the persona library demands. The Folklore Archive Institute of the Romanian Academy is one of many such institutions. Its participation in this work would not only contribute to the development of synthetic personality design; it would also establish a generative relationship between traditional folkloric scholarship and contemporary AI development that has been, to date, almost entirely absent.

The next decade will see the deployment, at scale, of artificial agents whose personalities will shape billions of human interactions. The question is not whether these personalities will be designed; they will. The question is from what cultural materials they will be constructed, by whom, with what attribution, and in service of what values. We propose that the literary inheritance of the human species, generously and carefully translated, is the right material; that humanities institutions in partnership with industry are the right designers; that visible attribution is the right standard; and that cultural diversity is the right value. The work is overdue.

References

Adila, D., Zhuang, S. and Zou, J. (2024) 'Behavioural drift in long-context language model conversations', arXiv preprint, arXiv:2403.04567.
Bal, M. (2009) Narratology: Introduction to the theory of narrative. 3rd edn. Toronto: University of Toronto Press.
Booth, W. C. (1961) The rhetoric of fiction. Chicago: University of Chicago Press.
Forster, E. M. (1927) Aspects of the novel. London: Edward Arnold.
Coombe, R. J. (2003) 'Fear, hope, and longing for the future of authorship and a revitalised public domain in global regimes of intellectual property', DePaul Law Review, 52(4), pp. 1171–1191.
Eco, U. (1979) The role of the reader: Explorations in the semiotics of texts. Bloomington: Indiana University Press.
European Parliament (2024) Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L series.
Genette, G. (1980) Narrative discourse: An essay in method. Translated by J. E. Lewin. Ithaca: Cornell University Press.
Hofstee, W. K. B., de Raad, B. and Goldberg, L. R. (1992) 'Integration of the Big Five and circumplex approaches to trait structure', Journal of Personality and Social Psychology, 63(1), pp. 146–163.
McCrae, R. R. and Costa, P. T. (1987) 'Validation of the five-factor model of personality across instruments and observers', Journal of Personality and Social Psychology, 52(1), pp. 81–90.
McCrae, R. R. and Terracciano, A. (2005) 'Personality profiles of cultures: Aggregate personality traits', Journal of Personality and Social Psychology, 89(3), pp. 407–425.
Reeves, B. and Nass, C. (1996) The media equation: How people treat computers, television, and new media like real people and places. Cambridge: Cambridge University Press.
Sætra, H. S. (2021) 'Social robot deception and the culture of trust', Paladyn, Journal of Behavioural Robotics, 12(1), pp. 276–286.
Ivanov, S. A. (2006) Holy fools in Byzantium and beyond. Oxford: Oxford University Press.
Uther, H. J. (2004) The types of international folktales: A classification and bibliography, based on the system of Antti Aarne and Stith Thompson. Helsinki: Suomalainen Tiedeakatemia.
Vermeule, B. (2010) Why do we care about literary characters? Baltimore: Johns Hopkins University Press.

correspondence Liviu Pop · Folklore Archive Institute · Cluj-Napoca · Romania

acknowledgements This work emerges from the gosum project, an applied research initiative on synthetic persona governance for embodied AI.

cite as Pop, L. (2026) 'Persons of the Page: Literature as a source for synthetic personality', gosum working paper v.01, April 2026.