← EpiDoc Workshop
Six Papers, One Thesis: Preserve, Don't Discard 第十二周 · EpiDoc · 计算分析与大规模复用
The argument that runs through all six papers · 贯穿六篇论文的核心论点
Every editorial decision an epigraphist makes — every restoration, every expansion, every dating, every place identification, every editorial hand — is data. Printed publication discards almost all of it. Structured XML markup preserves it. Preservation is the precondition for computational analysis and large-scale reuse. 每一项编辑判断,每一处补阙、每一个缩写展开、每一处定年、每一个地名识别、每一位编辑的工作,都是数据。印刷出版几乎全部丢弃。结构化 XML 标记则全部保留。保留 (preserve),是计算分析与大规模复用的前提条件
10,249
EpiDoc source files
(I.Sicily + iAph + IRCyr + IRT)
每一文件都保留完整编辑判断
59,641
hand-attested lemmas
across 6,285 distinct types
每词都连接到上下文与编者
85.5% ± 2.14pp
cross-corpus lemma consistency
1,042 multi-attested Greek surfaces
证明结构化保留可大规模聚合
CC-BY-4.0
openly licensed, openly downloadable
aggregation-compatible
FAIR 原则全部满足

The four numbers above are all directly computed from the mounted federation: /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/json/. Every slide that follows refers back to real data, not hypothetical examples.

Paper 1 · DHQ 3(1) · 2009 slide 1 / 5

Epigraphy in 20172017 年的铭文学

Hugh Cayless, Charlotte Roueché, Tom Elliott, Gabriel Bodard
Digital Humanities Quarterly 3 (1), 2009. "Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure"

"It is our aim to ensure that such publication is not just driven by considerations of economy or space, but is developed to meet the academic requirements... The history of epigraphy makes it quite clear that such transitions are natural to the discipline." — written in 2009, projecting where the field needs to be by 2017.

§1.1 · The problem with flat textslide 2 / 5

The Leiden conventions are typographic — not semantic 莱顿规约是排版规约,不是语义规约

Leiden (1931) describes how an inscription should look when printed. Square brackets mean letters were lost from the stone. Subscript dots mean a letter is unclear. Underlining means a previous editor read text the current editor cannot see.

When you put a Leiden text into a database as a text string, the information collapses. Cayless gives the example of underlined text in a flat database field: the underline carries semantic information ("a previous editor read this") but cannot be stored in a plain-text column, so the convention has to be hacked — often with an underscore character, which then conflicts with any other use of underscores.

Worked example · the same inscription, three ways ALA 81 — late-antique Aphrodisias, honouring Justinian
Leiden print form · 印本式
τ̣ὸν εὐσεβέστ̣[α]τον καὶ καλλί νικον ἡμῶν [δεσπό]τ̣ην Φλ(άουιον) [Ἰουστινια]νόν
EpiDoc XML · 结构化标记
<supplied reason="lost">τ</supplied>ὸν εὐσεβέσ<unclear reason="damage">τ</unclear> <supplied reason="lost">α</supplied> τον καὶ καλλίνικον ἡμῶν <supplied reason="lost">δεσπό</supplied> <unclear>τ</unclear>ην <expan>Φλ<supplied reason="abbreviation">άουιον</supplied></expan> <supplied reason="lost" cert="low">Ἰουστινια</supplied>νόν

The XML is more verbose; that is the cost. The benefit is that every editorial decision is now machine-distinguishable: reason="lost" versus reason="abbreviation" versus reason="damage" are three separate categories that a flat [...] collapses into one.

Cayless's point "Some conventions are represented differently from others, some may be inconsistently rendered when printed to the screen, some are not widely available. Moreover... their treatment in print, which is perfectly clear and unambiguous, does not work well in a digital environment."
§1.2 · Click a siglum, see what XML preservesslide 3 / 5

What the Leiden sigla mean — and how XML captures the meaning 每个莱顿符号对应的语义,以及 XML 如何标记

Click any siglum below to see what it encodes in the inscription and what flat text loses:

[ abc ]
Lost text, restored缺失,已补阙
α̣
Unclear letter字母不清
co(n)s(ul)
Expanded abbreviation缩写已展开
prior
Read by previous editor前编者所见
⟦Getae⟧
Erased in antiquity古代被抹除
<abc>
Editor adds omitted letters编者补充遗漏字
{abc}
Editor deletes superfluous letters编者删除衍字
∙ vacat ∙
Uninscribed space on stone石面留白

Click a siglum above

Each editorial siglum is a categorical distinction that flat text discards but XML preserves as a machine-queryable attribute.
Why this matters for queries Cayless: "A corpus of inscriptions should be able to be queried for the full list of abbreviations used within it, or for the number of occurrences of a word in its full form, neither abbreviated nor supplemented." A flat-text column cannot answer these questions — the data needed to answer them was thrown away at the moment of digitisation.

Real data · what Sicily's two libraries actually look like

Use the mounted folder /inscription_databases/EDCS_ETL-master/ (the SDAM ETL of EDCS) and the mounted I.Sicily corpus. Sicily's two "libraries" are both there. Compare what each preserves about the same province:

EDCS Sicilia · 6,110 rows · "what aggregation looks like"
Pattern emerges, individuality disappears. From the SDAM ETL aggregation of EDCS Sicilia:
1,417tituli sepulcrales(funerary)
1,327viri(records mentioning men)
730tituli fabricationis(maker's marks)
607sigilla impressa(stamps)
482tituli possessionis(ownership)
365mulieres(records mentioning women)
327tria nomina(three-part Roman names)
281tituli sacri(sacred)
What you cannot ask: show me what ISic000118 actually says, or who edited it, or which letter has a dot under it because Bivona could read it but the modern editor cannot.
I.Sicily ISic000118 · 1 record · "what encoding preserves"
Individuality lives, pattern requires aggregation. From /m0-demonstrator/data/xml/isicily/ISic000118.xml:
17,073characters of EpiDoc XML
275lines of source
4line-of-text <lb> elements
6named editors in <respStmt>
11change entries in <revisionDesc>
1conglomerate-block material with EAGLE URI
1Pleiades place reference (Thermae Himeraeae)
1geo-coordinate (37.98365, 13.69555)
5identifier crosswalks (TM, EDR, EDCS, URI, DOI)
What you can now ask: who edited this? when did Crellin add the lemmas? which previous editor saw text the current editor cannot see? what stone is this carved on? what other inscriptions are on the same petrographic substrate?

Both are useful. EDCS Sicilia is the only place to learn that funerary inscriptions are 23% of the province; ISic000118 is the only place to learn what one specific funerary inscription says, in what state, on what stone, edited by whom, on what dates. Neither replaces the other; both have to be preserved. EDCS preserves pattern at the cost of individuality. EpiDoc preserves individuality at the cost of pattern-readability. Cayless's argument is that the two registers are differently usable, and that the encoded register has been systematically undervalued because its preserved structure is less visible at first glance.

§1.3 · Scholarly primitivesslide 4 / 5

What XML enables that databases cannot XML 能做、平面数据库做不到的事

Cayless frames the discussion around John Unsworth's seven "scholarly primitives": discovery, annotation, comparing, referring, sampling, illustrating, representing. Database-driven sites do discovery well — full-text search — but the other six suffer when sigla and structure are flattened away.

Scholarly primitiveWhat flat text supportsWhat XML adds
Discovery · searchfull-text matchingtype-aware queries: "find liberti only when abbreviated", "every text dated after 200 CE", "every restoration marked cert="low""
Annotationmarginal notes lost on copystandoff annotation persists with the data, citable to line and word
Comparingside-by-side readingcomputed diff: lemma-overlap, formula-overlap, prosopographical overlap, all across thousands of texts
Referring · citationpage number, corpus abbreviationstable URI / DOI / Trismegistos number resolving to the specific edition
Samplingflip throughfiltered subsets: "all Christian-period funerary stones from Tripolitania", "all stelai whose dating chain mentions an emperor"
Illustratingplates and figuresIIIF image regions linked to text segments, queryable
Representingthe edition as a printed pagethe edition as a structured object — multiple parallel views (diplomatic, normalised, translated, lemmatised) from one source
!
Preserve, don't discard. Every primitive after "discovery" depends on the structure that XML keeps and flat text loses. The information has to be there for the computation to be possible.
§1.4 · The 2017 visionslide 5 / 5

Inscriptions as complex digital packages — not spreadsheet rows 铭文是复杂数字对象,不是表格行

Cayless's closing argument: an inscription is "a text situated in a complex environment" — it has history, support, find-context, palaeography, language, scholarly genealogy. Treating it as a row in a spreadsheet is treating the rich object as the impoverished proxy.

The 2017 vision: an epigrapher will compose a local corpus by drawing from multiple online repositories, load it into research tools, analyse it, then publish the dataset alongside the article. "None of this will be possible unless information is published in such a way that it is not concealed behind an interface, but is in addition retrievable in bulk."

A prediction now testable 2009 vision: by 2017, corpora would be downloadable in bulk, citable to the level of the digital object, distributed in multiple copies, openly licensed.
2025 reality: I.Sicily, IRT2021, IRCyr2020 ship XML on GitHub under CC-BY-4.0; Trismegistos provides cross-corpus identifiers; EpiDoc is the de facto standard; the federation lemma resource of the very demonstrator that hosts this slide deck aggregates 59,641 hand-attested lemmas across four corpora. The vision was largely achieved — slightly later than 2017.

Take-away for Week 12 · 本周要点

The student writing inscription data in EpiDoc is not encoding for output's sake. They are preserving editorial judgement as machine-actionable data. Every <supplied reason="lost"> they write is one more record in a dataset that future students, future tools, and future LLMs can use to ask questions that have not yet been asked.

Paper 2 · in Latin on Stone (Rowman & Littlefield) · 2010 slide 1 / 5

EpiDoc: Epigraphic Documents in XML
for Publication and InterchangeEpiDoc: 用于出版与交换的 XML 铭文文档

Gabriel Bodard
In F. Feraudi-Gruénais (ed.), Latin on Stone: Epigraphic Research and Electronic Archives (Lanham, MD: Rowman & Littlefield, 2010), pp. 101–118.

The canonical introduction to EpiDoc from one of its architects. "EpiDoc specifies the use of XML, Extensible Markup Language, an industry standard maintained and documented by the World Wide Web Consortium for communication and storage of structured data."

§2.1 · What EpiDoc isslide 2 / 5

EpiDoc = TEI subset for inscriptions, optimised for interchange EpiDoc = 面向铭文的 TEI 子集,专为交换而设

EpiDoc is not a new schema. It is a specialised set of guidelines for using the Text Encoding Initiative (TEI) XML — the same XML standard used in the literary, linguistic, and manuscript communities. Building on TEI means maximum compatibility with the wider digital-humanities ecosystem.

Origin
1999, Tom Elliott (UNC Chapel Hill), responding to the Rome 1999 Panciera round-table on Epigraphy and Information Technology
Base standard
TEI P5 XML (maintained by the W3C-style TEI Consortium)
First pilot
Inscriptions of Aphrodisias (Roueché et al.) — the workflows and tools that grew from this pilot now serve dozens of projects
Conformance levels
Leiden-conformant EpiDoc: minimum bracketed text marked unambiguously.
EAGLE-conformant EpiDoc: enough metadata to convert into EDH / EDR / EDB.
Tools shipped
Web-Application (Cocoon), XSL stylesheets, CHET-C (Leiden→XML), Crosswalker (XML↔DB)
Bodard's central claim "XML, unlike many mark-up and publishing systems... does not merely encode the appearance of a text, but can also embed information about its structure and semantics. Appearance in any given form... will be handled by a set of stylesheets."

The architectural separation matters: the data (what the inscription is, what was lost, what was restored, who edited it, when, on what authority) is separate from the view (HTML for the web, PDF for print, audio for the visually impaired, a database row for aggregation). One XML source supports many views; flat-text publication couples the two and discards everything that wasn't needed for the printed page.

§2.2 · ALA 2 worked exampleslide 3 / 5

One inscription, three ways to read it 一篇铭文的三种读法

Bodard's worked example is the Aphrodisias text ALA 2 — a fragmentary statue base honouring Salonina (Julia Cornelia Salonina Augusta). The text below is the one published in Aphrodisias in Late Antiquity (Roueché 1989, ALA 2).

Worked example · ALA 2 (Aphrodisias, 4th cent. CE) honouring Julia Cornelia Salonina Augusta
Leiden transcription · 莱顿转写
[Ἰουλίαν Κορνη] λί̣αν Σαλω̣ν̣[εῖ] ναν Σεβαστὴ̣[ν]    vacat ἡ λαμπροτάτη Ἀ φροδει̣σ̣[ι]έων πό scroll [λις] scroll
EpiDoc XML · Leiden-conformant encoding
<lb/><supplied reason="lost">Ἰουλίαν Κορνη</supplied> <lb type="worddiv"/><unclear>λί</unclear>αν Σα<unclear>λ</unclear>ω<unclear>ν</unclear> <supplied reason="lost">εῖ</supplied> <lb type="worddiv"/>ναν Σεβαστὴ<unclear>ν</unclear> <lb/><space extent="1" unit="line"/> <lb/>ἡ λαμπροτάτη Ἀ <lb type="worddiv"/>φροδει<unclear>σ</unclear> <supplied reason="lost">ι</supplied>έων πό <lb type="worddiv"/> <g type="scroll"/> <supplied reason="lost">λις</supplied> <g type="scroll"/>

The XML is verbose; but every decision now has a name. <lb type="worddiv"/> says a word is broken across the line-end. <unclear> says this letter can be read but with some doubt. <supplied reason="lost"> says I, the editor, supplied this restoration because the surface was lost. <g type="scroll"/> says there is a non-textual glyph (a scroll mark) at this location on the stone. The flat [...] said only "there is something here you cannot see."

The same data, additional layers

Once a word is in the XML, additional information can be attached non-destructively. Two examples from Bodard's discussion:

<w lemma="λαμπρός">λαμπροτάτη</w> <!-- Word marked, lemma (= dictionary form) attached as attribute --> <name ref="#Σαλωνῖνα">Σαλωνεῖ<lb type="worddiv"/>ναν</name> <!-- Name marked, regularised form pointed to by @ref (e.g. an onomastic authority list of Greek names), spans a line-break without breaking -->
+
The point is layered preservation. A flat Leiden text could not say "this is the proper name Salonina, broken across a line-end, attested elsewhere under the same canonical form." The XML can — without losing the original line break, the original spelling, or the original reading.

Real editorial chain · ISic000118 over 9 years

Bodard argues that EpiDoc captures information no print edition could carry: the full editorial history of a record, with named contributors and date stamps. Here is the actual <revisionDesc> from ISic000118.xml in the mounted demonstrator — a single funerary epitaph whose digital edition has been touched 11 times by 6 named editors over 9 years:

<revisionDesc status="draft"> <listChange> <change when="2016-12-03" who="#JCu">James Cummings autogenerated EpiDoc output from database</change> <change when="2019-01-18" who="#TA">Tuuli Ahlholm cleaned up the autogenerated text and added an apparatus and translation</change> <change when="2020-10-05" who="#SS">Simona Stoyanova normalised Unicode</change> <change when="2020-10-08" who="#SS">Simona Stoyanova updated list of languages</change> <change when="2020-11-20" who="#SS">Simona Stoyanova added EDCS numbers</change> <change when="2020-11-26" who="#SS">Simona Stoyanova restructured bibliography</change> <change when="2020-12-17" who="#JP">Updated Zenodo DOI</change> <change when="2021-01-19" who="#SS">renumbered files, uris and references</change> <change when="2024-04-10" who="#JP">Jonathan Prag revised from publications, added image</change> <change when="2025-06-16" who="#RC">Robert Crellin provided lemmatizations</change> <change when="2025-08-13" who="#RC">Robert Crellin moved global IDs to inner w elements</change> </listChange> </revisionDesc>

Each who="#JCu" reference resolves to a <respStmt> earlier in the file. Some of those entries carry ORCID identifiers — Prag's http://orcid.org/0000-0003-3819-8537, Stoyanova's 0000-0003-3914-9569, Cummings's 0000-0002-6686-3728, Ahlholm's 0000-0001-8417-7089, Crellin's 0000-0002-0100-7437, the petrographer Coccato's 0000-0002-6641-2820 — making the editorial history not just citable but globally addressable to real persistent researcher identities.

What this preserves that print discards: the sequence in which the digital edition came into being, the named scholar responsible for each step, the date stamp, the typology of the change ("normalised Unicode", "added EDCS numbers", "provided lemmatizations" are categorically different operations). A reader citing this inscription in 2030 can know which version they are citing — because versions exist as recorded data.
§2.3 · Transformationsslide 4 / 5

One source, many outputs — via XSLT 单一源数据,多种产出,经由 XSLT

The capability that makes structured preservation worth the cost: multiple deliverables from one source, generated automatically. A single XSL Transformation script can:

EpiDoc XML
one source file
XSLT
transformation script
Many outputs
HTML · PDF · DB · audio · index
  • HTML for the web — Leiden-style transcription with interactive footnotes
  • PDF for typeset print — print-on-demand volume from the same data
  • Diplomatic version — strip out editorial restorations to show what is actually on the stone
  • Indexed onomasticon — automatically generated from <name> elements across the corpus
  • Concordance — every attestation of every <w lemma="..."> grouped by lemma
  • Audio — for visually impaired readers, with editorial conventions read aloud
  • Database row — for aggregation into EDH, EDR, EDCS or Wikidata
  • Translation alignment — line-by-line parallel text in any other language
Bodard's principle "The epigraphist need only do the intellectual work of compiling her publication once, and the content will always reflect this master version of the work."

The corollary: every transformation is reversible. If a new question requires a different output, the XSLT is rewritten, the same XML is re-transformed, and the new output appears. The data has not been touched. The data does not need to be touched, because the data already preserves the structure the new question depends on.

§2.4 · Skills and standingslide 5 / 5

What a project needs · what a scholar needs to learn 项目所需 · 学者需学的技能

Bodard is honest about the cost. EpiDoc requires more skill than typing Leiden into Word. But the skills are not technical novelties — they are editorial decisions made explicit.

  • Scholar: learns to read EpiDoc XML well enough to review work; understands what each tag means as an editorial distinction; engages with the EpiDoc community via lists and fora.
  • Programmer / RA: customises XSLT for the project; adapts CHET-C or similar Leiden→XML converter; ties the output into a web design.
  • Web designer: wraps the XSLT-generated HTML into a coherent site.
  • Not needed: a proprietary database system that locks the data away; reinventing tag conventions; software that depends on a single vendor.

Take-away for Week 12 · 本周要点

Bodard's paper is the practical companion to Cayless's vision. Cayless asks why preserve; Bodard answers how. When the student in Week 12 writes their first <supplied reason="lost">, they are doing exactly what Bodard's paper describes — turning editorial judgement into machine-readable structure. The verbosity is the visible cost; the reusability is the invisible benefit, paid back many times over as the same data is rendered, indexed, aggregated, translated, and queried.

One source, many uses. The data is preserved once. Every future tool that wants to query inscriptions for restorations marked cert="low", or count abbreviations expanded by type, or list every editor who has touched a given file via <revisionDesc> — every such tool can do so without re-editing the inscriptions. Because the data is there.
Paper 3 · ARQUEOLÓGICA 2.0, abstract · 2021 slide 1 / 4

Integrating Palaeographic Research
into the Digital Epigraphy of Multilingual Sicily将古文字学整合进西西里多语铭文的数字化研究

Simona Stoyanova & Jonathan Prag (University of Oxford)
Abstract for the ARQUEOLÓGICA 2.0 / Trinacria conference, May 2021. Tied to the ERC Advanced Grant project Crossreads at Oxford.

Beyond text: what happens when palaeography — the shape of the letters, the surface they were cut into, the choices the cutter made — also becomes structured, queryable, computable data.

§3.1 · Crossreads' five modulesslide 2 / 4

An inscription has more than text · 一篇铭文不止是文本

Sicily across 1,500 years carries Greek, Latin, Punic, Elymian, Sikel, and Oscan writing on stone, ceramic, brick, metal, and wax. The Crossreads project asks what happens when every dimension of these inscriptions — not just the text — is preserved as linked structured data.

The research questions enabled Are letterforms different depending on material? Does the stone-type determine the choice of forms? Are letterforms different in private versus public inscriptions? Do letterforms cross over between languages?

None of these questions is answerable from a flat-text corpus. They require palaeographic annotation, material identification, geographic data, dating, and language classification to be linked simultaneously. The EpiDoc structure is the substrate that makes the linkage possible.
§3.2 · How preservation enables the questionslide 3 / 4

Connecting letterforms to text, material, date, place 将字形与文本、材质、年代、地点相互连接

Stoyanova & Prag describe what they call the "main challenge" — designing an effective mechanism to integrate information from multiple datasets to enable rich and complex queries. The integration is achievable only because each layer was, individually, preserved as structured data with stable identifiers and linked-open-data anchors.

An example from a real I.Sicily file (ISic000118, a funerary epitaph from Thermae Himeraeae) shows what structured preservation of just one dimension — material — looks like:

<material ana="#material.inorganic.stone" type="stone.unspecified" subtype="unspecified" resp="#Coccato" ref="http://www.eagle-network.eu/voc/material/lod/74.html">conglomerate</material>

This single element preserves: the material name in English (conglomerate); a taxonomy anchor (#material.inorganic.stone); a typological classification (type, subtype); the petrographic specialist who identified it (Coccato, with ORCID elsewhere in the file); and a URI to the EAGLE vocabulary entry that aligns this term with every other epigraphic project's controlled vocabulary for materials.

The flat-data alternative: a column "material" with the string "conglomerate". You cannot ask which specialist identified the material? You cannot ask show me everything Coccato has identified as conglomerate. You cannot ask show me all inscriptions on the same EAGLE material URI. Because the structure wasn't there.

Crossreads extends this pattern to letterforms

Each letterform observation — alpha with broken bar, alpha with straight bar, sigma four-bar versus lunate — is recorded against the IIIF image region, attributed to the palaeographer who made the observation, anchored to the EpiDoc text-position via stand-off markup, queryable against the material, date, place, and language of the inscription. The palaeographer's judgement becomes data the way the editor's restoration became data in Bodard's example.

§3.3 · The pattern generalisesslide 4 / 4

Every research dimension is a candidate for structured preservation 每一研究维度都可成为结构化保存对象

The Crossreads project demonstrates a pattern that runs through all six papers in this lecture: any aspect of an inscription that a scholar observes carefully enough to publish, is also an aspect that — if preserved as structured data — supports computational analysis at scale.

ObservationPrint formStructured form (XML/LOD)
The textLeiden brackets<supplied reason="lost">, <unclear>, <expan>...
The material"marble" in the description<material> with EAGLE URI + responsible specialist
The findspot"Termini Imerese (anc. Thermae Himeraeae)"<origPlace ref="pleiades:462513"> + geo-coordinate
The date"Imperial" in a footnote<origDate notBefore="0001" notAfter="0250" cert="low">
The hand"the cutter was inexpert"<handNote> with measurements + IIIF region links to letter samples
The letterform"alpha with broken bar" — in a plate captionArchetype annotation, image-region-addressable, queryable by date and material
The editorial chainbibl entries in tiny print<respStmt> with ORCID + <revisionDesc> with timestamped changes

Real evidence · different corpora preserve different things

The rubric scored at population scale (rubric_full.json, 10,249 source files across all four corpora) shows a striking pattern: encoding traditions specialise. Each corpus is strong on different rubric axes, exactly as Stoyanova & Prag predict for a multilingual, multi-period, multi-material region like Sicily — where no single encoding decision is universally optimal.

Rubric axis (median per corpus)I.Sicily (n=4,782)IRCyr (n=2,360)IRT (n=1,618)iAph (n=1,489)
① Semantic layers · 不同内容层数10 🟢997
② URI density · 外向 LOD 链接12 🟢560 ⚠
③ respStmt depth · 编辑责任链深度47 🟢46
④ Citability granularity · 引用粒度2333
⑤ Question coverage · 可回答问题类6910 🟢8
⑥ Provenance · 出处可追溯3331
⑦ Bibl genealogy · 学术谱系深度4244
Total median42383828
What the numbers reveal I.Sicily 2020s encoding tops semantic layers and URI density — modern build pipeline with full EAGLE / Pleiades / Getty AAT URIs. IRCyr tops respStmt depth — the King's College London editorial chain is long and explicit. IRT tops question coverage — comprehensive metadata for every record. iAph scores 0 on outbound URIs (2007-era encoding) but its bibl genealogy chains run back to Sherard 1716 — the rubric does not capture this fully, but the scholarly value is there.

No single corpus dominates every axis. Stoyanova & Prag's argument for multi-modular preservation — text + linguistic annotation + petrography + IIIF + palaeography — extends this insight: different research dimensions require different preservation moves, and the encoding format has to support all of them simultaneously without forcing trade-offs.

Take-away for Week 12 · 本周要点

EpiDoc is not only for the inscription's text. It is for everything the scholar observes about the inscription — including aspects (letterforms, petrography, editorial hands) that traditional print could not even document well. The XML structure is what makes those observations linkable to each other, and therefore queryable, and therefore actually usable for the kinds of large-scale historical and linguistic questions the field wants to ask.

Connected to the demonstrator The I.Sicily file ISic000118 referenced on this slide ships in this very workshop bundle at /data/xml/isicily/ISic000118.xml. Open it: you will find the full material, palaeographic, geographic, and editorial-chain structure preserved exactly as Stoyanova & Prag's framework describes.
Paper 4 · ACM JOCCH 16(3) · 2023 slide 1 / 5

Describing Inscriptions of Ancient Italy:
The ItAnt Project and Its Information Encoding Process描述古意大利铭文:ItAnt 项目及其信息编码流程

Francesca Murano · Valeria Quochi · Angelo Mario Del Grosso · Luca Rigobianco · Mariarosaria Zinzi
ACM Journal on Computing and Cultural Heritage 16, 3 (August 2023), pp. 1–14. DOI: 10.1145/3606703

The detailed case-study of customising EpiDoc for fragmentary, non-canonical languages (Oscan, Faliscan, Venetic, Cisalpine Celtic) where the standard tagset, designed for Greek and Latin, requires deliberate extension — without breaking interoperability.

§4.1 · The challengeslide 2 / 5

What if your inscription is in Faliscan, not Latin? 如果你的铭文不是拉丁文,而是法利斯语呢?

The ItAnt project (Italian PRIN, CNR-ILC Pisa + Ca' Foscari Venice + University of Florence) focuses on the languages of pre-Roman Italy attested only in epigraphic form — Restsprachen, "remnant languages." Their epigraphic record is fragmentary, often uncertain in reading and segmentation, and presents challenges the standard EpiDoc tagset (built first for Greek and Latin) does not directly handle.

LanguagePeriodScript(s)Particular challenges
Oscan5th c. BCE – 1st c. CEOscan, Latin, GreekSame language in three different scripts → cannot collapse language and script into one attribute
Faliscan7th – 2nd c. BCEFaliscan, LatinScriptio continua (no word division on stone) → word-boundary decisions are editorial
Venetic6th c. BCE – 1st c. BCEVenetic (North Italic)Syllabic punctuation marks syllables, not words → a notation system with no EpiDoc precedent
Cisalpine Celtic6th c. BCE – 1st c. CELepontic, LatinLinguistic identification of fragments is itself contested → uncertainty has to be encoded

Murano et al.'s response is not to invent a new schema. It is to extend EpiDoc with carefully chosen additions that record exactly the distinctions these materials require — preserving them as structured data, while keeping interoperability with the international EpiDoc community.

§4.2 · Five customisationsslide 3 / 5

The ItAnt EpiDoc extensions ItAnt 的五项 EpiDoc 扩展

Principle Extend the standard rather than replace it. Every ItAnt customisation is achievable with controlled vocabularies inside the EpiDoc / TEI tagset (mostly <rs> elements with typed @ana or @type attributes). The output XML is still EpiDoc-conformant. Standard EpiDoc tooling can read it. New ItAnt-aware tooling can extract the customised information.
§4.3 · Why these customisations matterslide 4 / 5

What flat-text would discard 平面文本会丢弃什么

Consider an Oscan inscription written in Greek script. In a flat-text edition you would see the Greek letters and a footnote saying "Oscan." In EpiDoc with ItAnt's customisations:

<scriptNote> <rs type="script" ana="#script.greek.oscan-adapted">Greek script (Oscan-adapted form)</rs> <rs type="worddiv">punctuation (single dot)</rs> </scriptNote> <language ident="osc">Oscan</language>

That XML says, machine-readably:

  • The language is Oscan (ident="osc") — searchable, aggregateable, joinable to Wikidata and linguistic resources.
  • The script is Greek (in its Oscan-adapted form) — independently queryable. "Show me all Oscan inscriptions written in Greek script" is one query; "show me all Oscan inscriptions written in Oscan script" is another. Both are answerable.
  • Words are separated by punctuation, not by spaces — so any text-processing tool knows how to tokenise this particular inscription correctly.
If you collapse language and script into one attribute — the standard EpiDoc behaviour — you cannot ask "which Oscan inscriptions were written in adapted Greek versus native Oscan letters?" That question is part of the historical-linguistic argument about Romanisation, contact, and bilingualism. ItAnt's customisation preserves the data that the argument needs. The argument is now computable, not just narratively assertible.

Real example · how fully-marked-up Greek looks at scale

The principle Murano et al. apply to fragmentary Italian languages applies equally to canonical Greek and Latin. Here is a fragment from A.30.xml (Anastasius I's imperial edict at Apollonia, c. 491–518 CE) in the mounted IRCyr corpus — every word marked with <w lemma="...">, every name marked with <persName> + <name> + nymRef, every abbreviation marked with <expan> + <abbr> + <ex>, every lost letter marked with <supplied reason="lost">:

<lb n="1"/><persName type="emperor" key="anastasius"> <supplied reason="lost"> <w lemma="Αὐτοκράτωρ">Αὐτοκράτωρ</w> <name nymRef="Καῖσαρ">Καῖϲαρ</name> <name nymRef="Φλάβιος"><expan><abbr>Φλ</abbr><ex>άβιοϲ</ex></expan></name> <name nymRef="Ἀναστάσιος">Ἀναϲτάϲιοϲ</name> <w lemma="νικητής">νικητὴϲ</w> <w lemma="εὐσεβής">Εὐϲεβὴϲ</w> <w lemma="εὐτυχής">Εὐτυχὴϲ</w> </supplied> </persName>

Counting only Greek-script <w lemma=> elements across the four EpiDoc inscription corpora, the demonstrator's federation lemma resource federation_lemmas_full.json records 41,411 attestations. That is the linguistic substrate ItAnt-style customisations enable at federation scale. Each token-lemma pair preserves: surface form, normalised surface, lemma, normalised lemma, line number, file, corpus, language, provenance, licence. No print edition could carry this density.

Connection to the demonstrator The file A.30.xml referenced here lives at /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/xml/ircyr/A.30.xml (mounted) and at /Users/chingyuanwu/Documents/epidoc/kcl_tei/cyrenaica/A.30.xml (also mounted). Try opening either — count the <w lemma=> elements. There are 177 of them in this one edict, every Greek word individually preserved with its dictionary form.
§4.4 · CIDOC CRM integrationslide 5 / 5

Beyond EpiDoc: connecting to the cultural-heritage semantic web 超越 EpiDoc:接入文化遗产语义网

ItAnt does not stop at extending EpiDoc. Murano et al. also encode the inscriptions against the CIDOC CRM conceptual reference model (the ISO 21127 ontology used across European cultural-heritage institutions) and its extensions CRMtex (for ancient texts) and CRMinf (for scholarly arguments and inferences). And bibliographic metadata is encoded in FRBRoo / LRMoo.

EpiDoc XML
text + editorial
CIDOC CRM
cultural-heritage ontology
CRMtex / CRMinf
text + argumentation
LRMoo
bibliographic

The result is a record that is simultaneously: an EpiDoc inscription, a CIDOC-CRM-typed cultural object, a CRMtex-typed text-bearing artefact, a CRMinf-typed scholarly argument, and an LRMoo-typed bibliographic citation. Each lens supports different queries. The data exists once; the queries — and the interoperability with non-epigraphic cultural-heritage databases — multiply.

Take-away for Week 12 · 本周要点

Murano et al.'s paper shows that EpiDoc is not a monolithic standard you accept or reject. It is a foundation that extends to your specific research needs. The lesson for the student is twofold: first, every customisation must be principled (use <rs> with controlled vocabularies, not invent new elements); second, every customisation should serve preservation — adding categorical distinctions the standard does not yet carry, never collapsing distinctions the standard already records.

+
Even non-canonical materials become first-class digital citizens. Faliscan and Oscan inscriptions, attested in fragments and read with deep uncertainty, can be encoded with the same rigour as a Greek imperial decree from Aphrodisias. The preservation framework is the same. What changes is which categorical distinctions matter for which corpus.
Paper 5 · CLARIN Annual Conference 2023 · pub. 2024 slide 1 / 4

Domain-Specific Languages for Epigraphy:
The Case of ItAnt面向铭文的领域专用语言:以 ItAnt 为例

Valeria Quochi · Luca Rigobianco · Federico Boschetti
In Selected papers from the CLARIN Annual Conference 2023, Linköping Electronic Conference Proceedings 210, ed. K. Lindén, T. Kontino & J. Niemi, pp. 191–202. Linköping University Electronic Press, 2024. DOI: 10.3384/ecp210007.

The companion paper to Murano et al. Once you have committed to preserving all the data in EpiDoc XML, the question becomes: how do you make it humane to write? Answer: a Domain-Specific Language that compiles down to EpiDoc.

§5.1 · The ergonomics problemslide 2 / 4

EpiDoc XML preserves everything — and is painful to write EpiDoc XML 保留一切,也极难手写

The cost of preservation is verbosity. Each <supplied reason="lost"> is twenty-four characters of tag for one character of editorial decision. Each opening tag matches a closing tag. Each attribute name and value must be written in full. The percentage of information content to structural scaffolding is unbalanced — and human readability decreases rapidly as complexity increases.

In ItAnt, the data density is high: linguistic, philological, and prosopographical information overlaps. A lacuna spans the end of token 3 and the start of token 4. A named entity (praenomen partially conjectured, gentilicium, patronymic) spans tokens 4 through 6. These overlapping hierarchies are the well-known weakness of XML, addressed by stand-off markup or alternative representations — but both add complexity for the human encoder.

Side-by-side · DSL vs raw EpiDoc illustrative excerpt of a Faliscan inscription
ItAntDSL · 紧凑 · ~3× shorter
title: Faliscan funerary inscription id: ItAnt_Fal_6 language: xfa script: faliscan date: -300 / -250 edition: l_1: [titoi]·tutei l_2: ka.lleiui word l_1_w_1: [titoi] name: praenomen word l_1_w_2: tutei name: gentilicium ref: #person_tut_001
EpiDoc XML · same content · much longer
<TEI><teiHeader> <fileDesc><titleStmt><title>Faliscan funerary inscription</title></titleStmt> ... <langUsage><language ident="xfa">Faliscan </language></langUsage> ... </teiHeader><text><body> <div type="edition" xml:lang="xfa"><ab> <lb n="1"/> <name type="praenomen"> <supplied reason="lost">titoi</supplied> </name>·<name type="gentilicium" ref="#person_tut_001">tutei</name> <lb n="2"/>ka.lleiui </ab></div> </body></text></TEI>

Boschetti et al. test this with a real philologist (linguistics graduate, epigraphic competence, basic DH skills only). She encodes five Faliscan inscriptions twice — once in raw EpiDoc, once in ItAntDSL. The results are qualitative but consistent: training time on the DSL is significantly shorter; encoding time is markedly lower; DSL documents are approximately three times more compact than the EpiDoc counterparts.

Concrete · how big is one EpiDoc file, really?

Look at the actual ISic000118.xml in the mounted demonstrator. This is one funerary inscription. Four lines of Latin on the stone. What is preserved in XML, measured in bytes:

MeasurementValueWhat it tells us
The inscription itself4 lines of Latin (~50 words)What goes on the printed page
EpiDoc XML file size17,073 characters~400× the size of the bare text
Lines of XML source275 lines~70× the line count of the inscription
Named editors in respStmt6Provenance preserved per contribution
Change entries in revisionDesc11Editorial history preserved
External LOD URIs (Pleiades, EAGLE, EDR, TM, DOI, ORCID)12+Linked into the international graph
This is the cost of preservation. 50 words of inscribed Latin become 17,000 characters of structured XML. Boschetti et al.'s argument is not that this cost is wrong — it is that the cost should be paid by a tool (the DSL parser) and not by the scholar (writing every tag by hand). The XML stays. The friction goes.
§5.2 · The pipelineslide 3 / 4

How the DSL becomes EpiDoc XML DSL 如何转译为 EpiDoc XML

ItAntDSL text
.dsl file
human-written
ANTLR parser
context-free grammar
XML-ItAnt
intermediate
proprietary schema
XQuery + XSLT
expansion +
merging with YAML
EpiDoc XML
conformant
output

Key design choices that make this work:

  • ANTLR for parsing. The DSL has a formal context-free grammar (available on GitHub at CoPhi/itantdsl). The parser produces an Abstract Syntax Tree from which an intermediate XML representation is generated.
  • Six design dimensions optimised (Zenzaro et al. 2022): familiarity, transparency, completeness, compactness, consistency, actionability. The DSL wins on the first three; XML wins on actionability; both target completeness and consistency.
  • Shared metadata in YAML. Information repeated across every inscription (language definitions, script definitions, vocabulary aliases) lives once in a YAML look-up file. The DSL references it by key. The XSLT expansion merges it in. Each inscription stays compact.
  • EpiDoc is the canonical artefact. The DSL is the editorial surface. The XML is the data that ships. ItAnt does not invent an alternative standard — the international ecosystem still receives standard EpiDoc.
Important caveat The DSL is a workflow tool, not a data format. The data that ships to repositories, gets cited, gets re-used by other tools, is the EpiDoc XML. The DSL exists to make writing that XML humane. If a researcher in 2050 wants to read an ItAnt inscription, they will find it in EpiDoc XML — the same format as I.Sicily, IRT, IRCyr, iAph — interoperable with everything else in the field.
§5.3 · The lessonslide 4 / 4

Ergonomics serves preservation — does not bypass it 人机工效服务于保留,而非绕开保留

The temptation, when faced with EpiDoc's verbosity, is to say "I'll just use a flat text and skip the XML." That is the move Boschetti et al. explicitly reject. The DSL is faster and more compact and easier to learn, but it produces the same full EpiDoc XML the field expects. The structure is preserved end-to-end.

Three layers of value, only the bottom layer ships. Editorial surface (DSL) for the writing scholar. Intermediate XML for the build pipeline. EpiDoc XML for the world. The first two are local conveniences; the third is the durable, citable, machine-readable record that the field, the LLMs, and the future inherit.

Connection to your Week 12 hands-on · 与本周实践的关系

When you first try to write inscription data in EpiDoc, you will feel the friction Boschetti et al. document. That is normal. The friction is a sign that you are being asked to articulate distinctions that flat-text writing simply did not require you to make. The friction is the cost of preservation.

For your first attempts, write the EpiDoc by hand. Feel where it pinches. Notice which decisions feel laborious (line breaks? abbreviations? names?). Then — and only then — would it be sensible to look for or build a DSL or template editor for your specific corpus. The DSL is a useful tool, but the EpiDoc literacy is the foundational skill, and the friction of writing it raw is part of how you build that literacy.

Tools to know about EpiDoc Front-End Services (EFES): ready-made stylesheets to render your EpiDoc as a publication.
Papyrological Editor (papyri.info): the canonical community-curation workflow for EpiDoc, with versioning and peer review.
ItAntDSL: domain-specific editor for fragmentary Italian languages, deposited at hdl.handle.net/20.500.11752/ILC-1003.
Leiden ⇄ EpiDoc playground: shipped in this workshop bundle as a sibling file to this slide deck — a workshop-grade DSL surface for trying out the conversion live.
Paper 6 · XVIe Congrès International d'Épigraphie · 2024 slide 1 / 5

Open Scholarship:
Epigraphic Corpora in the Digital Age开放学术:数字时代的铭文语料库

John Bodel · Jonathan Prag · Charlotte Roueché
In P. Fröhlich & M. Navarro Cabellero (eds.), L'épigraphie au XXIe siècle. Actes du XVIe Congrès International d'Épigraphie Grecque et Latine (Bordeaux, 29 Aug.–2 Sep. 2022). Bordeaux: Ausonius, 2024, pp. 91–117.

The synthesis paper. Three of the field's most senior figures, looking back across the corpora tradition from Boeckh and Mommsen to FAIR Epigraphy, argue that the largest obstacle facing digital epigraphy is no longer technical: it is cultural.

§6.1 · The 200-year arcslide 2 / 5

From CIG (1815) to FAIR Epigraphy (2022): why corpora exist 从 CIG (1815) 到 FAIR 铭文学 (2022):语料库为何存在

Boeckh's 1815 proposal to the Berlin Academy already had the core elements: comprehensive geographic coverage, autopsy-based editions, indices of names. What he could not have anticipated: that the 19th-century print model would, two centuries later, encounter limits that digital methods now have to address.

Boeckh's three lessons
(i) long-term funding and collaboration are essential; (ii) the work always takes longer than anticipated; (iii) any corpus, however professionally executed, will inevitably require revision and updating.
Klaffenbach 1953
"Users of a corpus are looking for: information on the inscribed object, findspot and current location, a reliable text, a date, and the minimum information necessary to understand the inscription." None of this has changed.
What has changed
The medium. Klaffenbach assumed folio volumes housed in major research libraries. We can no longer assume that the medium has to constrain what an edition is.

Bodel/Prag/Roueché's claim: digital methods, in particular EpiDoc TEI for encoding and Linked Open Data for connection, enable epigraphic corpora to capture far more than Leiden or its codex-bound print conventions could ever carry. The question is no longer whether digital is possible — it is whether the community is willing to adapt its practice.

Bodel/Prag/Roueché's central claim "The largest obstacle still facing us as a discipline today is not technical but cultural: our academic culture has not caught up to advancements in technology that make possible the dissemination and sharing of information with unprecedented ease."
§6.2 · The citation problemslide 3 / 5

Why "CIL I² 1221 = VI 9499 = ILS 7472 = CLE 959 = ILLRP 793" is wrong 为何"等号串联多版编号"的引用方式是错误的

One of the paper's sharpest critiques targets a practice almost every epigraphist has engaged in: citing an inscription by stringing together several corpus numbers separated by equals signs, as if they were identifiers for the same text rather than separate editions by separate scholars at separate times.

Louis Robert's principle (1954) "Quand l'éditeur n'est pas l'auteur de la copie, n'a pas vu la pierre ou l'estampage... il faut indiquer avec soin de qui est la copie; c'est capital pour la critique."
("When the editor is not the author of the transcription, has not seen the stone or the squeeze... one must indicate with care whose copy it is; this is essential for criticism.")

The citation chain conceals authorship. It treats edition as if it were identifier. And in the digital age this practice is, if anything, more dangerous. The EDCS entry might be a copy of an existing edition, or it might be modified — without a date, without a named editor, without a documented method. The PHI text might or might not reflect a particular published edition. The user has no way to know.

What structured preservation enables instead

ResourceCitation levelWhat it tells you
CIL I² 1221print corpus, page 1221, vol. 2 of edition 2 (1918)Mommsen autopsy, recorded "descripsi"
EDR167214 (Butini 2022)specific digital edition, attributed, datedAuthor, revision date, source-edition explicit; checked against image
TM574526abstract identifier, not an editionInscription-in-the-abstract pointer; resolves to a list of all known editions
EDCS-19200211opaqueSome text, some metadata, no attribution, no date, no method

The contrast is stark. EDR's edition is structured, attributed, and citable to a particular scholar at a particular time, because the EpiDoc XML preserves those things. EDCS's entry — even though it exists in the same digital medium — is opaque because the underlying data structure did not preserve provenance. The opacity is not because the medium is digital; it is because the data was not structured to preserve authorship and method.

§6.3 · The IRT case studyslide 4 / 5

From IRT (1952) → IRT2009 → IRT2021 — preservation enabled iteration 从 IRT 印本 (1952) 到数字版 IRT2021,保留促成迭代

Joyce Reynolds and J.B. Ward-Perkins published The Inscriptions of Roman Tripolitania in 1952, post-war, with limited photography (38 plates for 1000 texts). Louis Robert promptly noted the limitations. The volume served the field for half a century.

In 2009 the British School at Rome, ISAW (NYU), and KCL collaborated to publish IRT2009 online — the 1952 text re-encoded in EpiDoc XML with images, translations, and structured metadata. Because the encoding preserved structure, when scholars wanted updates, those updates could be applied surgically. Joyce Reynolds provided English translations. Place identifications got Pleiades URIs. Editorial decisions stayed traceable.

In 2021, the same data was enriched: every Greek or Latin inscription from Tripolitania published since 1952 was added. The result, IRT2021, took less than 12 months to produce. Standard EpiDoc encoding meant the existing data did not have to be re-keyed. New inscriptions slotted in. Indices auto-regenerated. Wikidata links to LGPN (Lexicon of Greek Personal Names) and PIR (Prosopographia Imperii Romani) propagated.

IRT (1952)
print volume
1000 texts
38 plates
IRT2009
EpiDoc XML
+ images
+ translations
IRT2021
+ post-1952 finds
+ Wikidata links
+ LGPN / PIR
FAIR-Epigraphy
RDF serialisation
aggregator pilot
The principle illustrated Because the 2009 edition preserved the data — material, place, date, hand, editorial chain, line-and-word position, every editorial restoration as a categorical attribute — the 2021 enrichment was cheap. Adding Pleiades URIs did not require re-editing the inscriptions; it required querying which place names had Pleiades equivalents and injecting the URIs into the existing structure. Adding LGPN identifiers worked the same way. The data was already structured to receive them.

This is what Cayless's 2009 vision of "true datasets able to be queried, mined, and transformed" looks like in practice, fifteen years later. The corpus is no longer a static publication. It is a living, citable, versionable dataset, and the cost of updating it is bounded by the cost of writing new editions — not by the cost of resetting the entire print.

§6.4 · FAIR + LOD + communityslide 5 / 5

The future of the corpus: dynamic, federated, FAIR 语料库的未来:动态的、联邦的、FAIR 的

Bodel/Prag/Roueché's closing proposal is an inversion of the existing corpus model: instead of a single monumental corpus (CIG, CIL, IG) covering everything, build directly on the proliferation of local and thematic corpora (IRT, IRCyr, I.Sicily, USEP, PETRAE, etc.) and connect them via standards.

The principles · FAIR data

  • Findable: unique persistent identifier (DOI, TM, URI), rich metadata in a searchable resource
  • Accessible: retrievable by its identifier, using open/free protocols, with metadata accessible even when the data is restricted
  • Interoperable: shared formal vocabularies (EAGLE, FAIR Epigraphic Vocabularies, Getty AAT), language-aware, machine-actionable
  • Reusable: clear licensing, accurate provenance, community standards, rich documentation

The pilot · inscriptiones.org

The FAIR Epigraphy project (AHRC + DFG, dir. Horster & Prag) has demonstrated the pattern. A software engineer with no epigraphic background, working a few hours, transformed XML files from Roman Inscriptions of Britain (RIB), Inscriptions of Greek Cyrenaica (IGCyr), and I.Sicily into RDF using a community-built ontology. The output is queryable at https://inscriptiones.org/.

Three corpora, one query surface. Because each corpus preserves its data as structured EpiDoc XML aligned with shared vocabularies, federation is cheap. New corpora that follow the standards can be added without re-editing. This is what "preserve, don't discard" enables at the field level — not just per-record.

What this looks like in the mounted demonstrator

The same demonstrator that hosts these slides ships its own Pelagios TTL serialisation at /data/json/federation_pelagios.ttl, exporting all 56 records as oa:Annotation objects linked to pleiades: places. This is the actual output format Bodel/Prag/Roueché describe — a real RDF file, openly downloadable, sitting at federation scale:

@prefix oa: <http://www.w3.org/ns/oa#> . @prefix pelagios: <http://pelagios.github.io/vocab/terms#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix lawd: <http://lawd.info/ontology/> . # Example annotation from the federation TTL <https://papyri.info/ddbdp/p.abinn.1> a pelagios:AnnotatedThing ; dcterms:title "p.abinn.1" ; dcterms:language "grc" ; dcterms:temporal "0340/0350" ; dcterms:rights <https://creativecommons.org/licenses/by/3.0/> ; dcterms:subject "ddbdp" ; dcterms:identifier "10014" . <https://papyri.info/ddbdp/p.abinn.1/annotation/place> a oa:Annotation ; oa:hasTarget <https://papyri.info/ddbdp/p.abinn.1> ; oa:hasBody <https://pleiades.stoa.org/places/737040> ; oa:motivatedBy oa:linking .

Each inscription becomes a tuple in the Pelagios graph. Each places-link makes the inscription visible to Peripleo and any other Pelagios consumer. The full federation TTL is 336 triples across 56 records — a small file, but a working contribution to the international LOD graph that aggregators like inscriptiones.org and the Trismegistos cross-reference graph can ingest.

Real proof · the international encoding tradition is internally consistent

One of Bodel/Prag/Roueché's underlying assumptions — that structured preservation across corpora is meaningful enough to aggregate — is testable from the same demonstrator data. The cross-encoder consistency analysis (cross_encoder_full.json) measures: across 41,411 hand-attested Greek lemma instances from four EpiDoc inscription corpora, how often do different editorial teams give the same Greek surface form the same lemma?

Result: across 1,042 Greek surface forms attested in two or more corpora by independent editorial teams, 85.5% receive the same normalised lemma (95% CI ± 2.14pp). The 14.5% disagreement is mostly interpretive — same form, two plausible roots — not mechanical.

What this means: the field's editorial-practice cohesion is high enough that pooled benchmarks of Greek lemmatisation are defensible. Aggregation across corpora is methodologically possible. The preserved structured data is consistent enough to support cross-corpus computation — exactly the move Bodel/Prag/Roueché argue should be possible in principle.
85.5%
cross-corpus Greek lemma consistency
±2.14pp (95% CI)
n = 1,042 multi-attested surfaces
Source: cross_encoder_full.json in mounted demonstrator

What the community has to do · 学术共同体应做的事

  • Cite digital editions properly: author, date, version, URL
  • Treat each edition as a scholar's editorial work — not as an opaque database row
  • Agree on community standards for categories (inscription type, material, support) — leaving how the categories are represented free
  • Document the construction of editions: who recorded the metadata, who took the photograph, who compiled the bibliography
  • Use ORCIDs to track contributions in collaborative environments
  • Adopt versioning (Zenodo deposits, GitHub history, semantic versioning of corpus editions)
  • License data openly (CC-BY-4.0 is the field's emerging default)
  • Stop citing inscriptions by equals-sign chains that obscure authorship
  • Stop publishing data in HTML-only interfaces with no bulk export

Take-away for Week 12 · 本周要点

When you write inscription data in EpiDoc for the first time this week, you are participating in the system Bodel/Prag/Roueché describe. Your editorial decisions become part of a federation that — if the community standards hold — can be cited, versioned, aggregated, queried, and re-used by anyone in the field, forever. The cost of preservation is real. The dividend is the world the three authors are arguing for: a research culture in which our data is as open, citable, and re-usable as our arguments have always claimed to be.

Synthesis · all six papersclosing

What the six papers say together 六篇论文的合鸣

PaperYearCore argument · 核心论点
Cayless2009Flat-text encoding of Leiden discards semantic information; XML preserves it as a parse tree that supports the scholarly primitives beyond simple search.
Bodard2010EpiDoc is the operational form of Cayless's argument: a TEI subset that preserves editorial decisions as structured data, supports multiple outputs from one source, and connects to the wider digital-humanities ecosystem.
Stoyanova & Prag2021The preservation principle extends beyond text: palaeographic observation, petrographic identification, IIIF image regions, and linguistic annotation all become first-class structured data in the Crossreads framework.
Murano et al.2023EpiDoc is not Greek-and-Latin-only; with principled customisation (typed <rs> elements, controlled vocabularies, transparent IDs), it serves the fragmentary languages of pre-Roman Italy without breaking interoperability.
Boschetti et al.2024The ergonomic cost of writing EpiDoc by hand is real; a Domain-Specific Language can serve as a friendlier editorial surface, but the canonical output remains EpiDoc XML — preservation is never bypassed, only made more humane.
Bodel/Prag/Roueché2024The technical infrastructure for federated, FAIR, open epigraphic corpora exists. What remains is cultural: rigorous citation, community-agreed standards, named authorship, versioning, open licensing.
The thread running through all six · 贯穿六篇的共同线索
Preserve, don't discard. Every editorial judgement an epigraphist makes is data. Structured XML keeps it. Linked Open Data lets it travel. FAIR principles ensure it is found, accessed, integrated, and reused. The print volume's brilliance was in its compression; the digital edition's brilliance is in its expansion. We can have the brilliance of compression in any output we choose to render — but we can only have the brilliance of expansion if the underlying data is preserved. 每一项编辑判断都是数据。结构化的 XML 保留它,关联开放数据让它流通,FAIR 原则确保它可被发现、获取、互通、复用。印本之妙在于其压缩;数字版之妙在于其拓展。压缩的妙处可以在任意输出中重现,而拓展的妙处仅在底层数据被完整保留的前提下才存在。

Now — your hands-on · 现在,动手实践

Try writing inscription data in EpiDoc. Pick a short inscription from the mounted /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/xml/ corpus — perhaps ISic000118 (Latin epitaph) or A.30 (Greek imperial edict from Cyrene) — open its existing EpiDoc XML, and read it line by line. Then try writing your own short EpiDoc record from scratch for a Greek or Latin text of your choosing. Notice what the standard tagset gives you. Notice what your specific corpus might require beyond the standard. Notice which editorial decisions you are now recording as data that you might previously have left in your head or in a footnote.

The data you write this week is preservation infrastructure. Future you, and future others, will be able to ask it questions you have not yet thought of.

Mounted folders for hands-on work: /Users/chingyuanwu/Documents/epidoc/isicily/ · /Users/chingyuanwu/Documents/epidoc/kcl_tei/ (Aphrodisias, Cyrenaica, Tripolitania) · /Users/chingyuanwu/Documents/epidoc/inscription_databases/. The companion Leiden ⇄ EpiDoc playground is shipped at /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/leiden-playground.html.

Hands-on roadmap · 实践路线图supplement

Concrete artefacts to open today 今日可亲手打开的具体文件

Every claim in this deck was anchored in real files. The mounted /Users/chingyuanwu/Documents/epidoc/ folder hosts all of them. Here is the suggested walkthrough order:

For paperFile · 文件路径What to look at
§1 Cayless · §2 Bodard isicily/m0-demonstrator/data/xml/isicily/
ISic000118.xml
Open in any text editor. Find the <revisionDesc> at the top. Count the <change> elements. Find the <material> element. See how it carries an EAGLE URI. This is the level of preservation EpiDoc operationalises.
§2 Bodard · §4 Murano isicily/m0-demonstrator/data/xml/ircyr/
A.30.xml
Find the Anastasius edict text. Count <w lemma=> elements (177 of them). See how each Greek word is preserved with its dictionary form. See <persName type="emperor" key="anastasius"> linking to a person authority.
§2 Bodard · §6 Bodel/Prag/Roueché isicily/m0-demonstrator/data/xml/iaph/
iAph110305.xml
The Aphrodisian prize-list whose bibl chain reaches back to Sherard 1716 via Boeckh's CIG. See <bibl n="CIGII">Boeckh from Sherard, CIG 2758 A-G</bibl>. Three centuries of editorial scholarship preserved in machine-readable form.
§3 Stoyanova & Prag isicily/m0-demonstrator/data/json/
rubric_full.json
Population-scale rubric scoring of 10,249 source files. Look at per_corpus_aggregate → see the seven-axis profiles for each corpus. The "encoding traditions specialise" claim is right there in the numbers.
§3 Stoyanova & Prag · §5 Boschetti isicily/m0-demonstrator/data/json/
federation_lemmas_full.json
12 MB · 59,641 attestations · 6,285 distinct lemmas across 4 corpora. CC-BY-4.0. This is the federation-scale linguistic substrate. Load it in Python; aggregate by corpus; aggregate by language; aggregate by lemma. Every entry preserves its file, line, surface, normalised surface, and provenance.
§4 Murano · §6 Bodel/Prag/Roueché isicily/m0-demonstrator/data/json/
federation_pelagios.ttl
336 RDF triples — Pelagios serialisation of the 56-record federation, every record linked to its Pleiades place URI. Open in a text editor; or load into an RDF tool. The actual LOD output format Bodel/Prag/Roueché advocate.
§5 Boschetti isicily/m0-demonstrator/
leiden-playground.html
The companion editor for trying Leiden+ syntax and seeing the EpiDoc output live. Open in a browser. Try typing the first line of any ISic file. See the XML appear on the right.
§6 Bodel/Prag/Roueché isicily/m0-demonstrator/data/llm_ready/
dataset_card.md
The HuggingFace-style dataset card for the federation lemma resource. FAIR principles operationalised: license: cc-by-4.0, citations, methods, openly redistributable. This is what a FAIR-compliant epigraphic data deposit looks like.
EDCS comparison (Cayless's "flat data") inscription_databases/EDCS_ETL-master/
data/2022_09_allProvinces/
The 537,286-row SDAM ETL of the Epigraphic Database Clauss-Slaby — Latin Mediterranean. CC-BY-NC-SA-4.0. Look at EDCS_2022_dataset_metadata_SDAM.csv in the parent folder for what the flat-data side of the equation actually contains.
The KCL source corpora directly kcl_tei/
(aphrodisias / cyrenaica / tripolitania)
The three KCL-edited corpora at their canonical locations. kcl_tei/cyrenaica/A.30.xml is the same Anastasius edict; kcl_tei/aphrodisias/iAph110305.xml is the same Sherard→Boeckh→Roueché chain. Open from either path; the EpiDoc XML is byte-identical.
Suggested 90-minute exercise Open ISic000118.xml. Identify ten elements you have never seen before. For each, open the EpiDoc Guidelines at https://epidoc.stoa.org/gl/latest/ and look up what the element preserves. Then write a 10-line EpiDoc record for a Greek or Latin inscription of your choosing — use the same patterns. See what you can preserve at line, word, and character level. This is the moment the abstraction becomes a skill.