(I.Sicily + iAph + IRCyr + IRT)
每一文件都保留完整编辑判断
across 6,285 distinct types
每词都连接到上下文与编者
1,042 multi-attested Greek surfaces
证明结构化保留可大规模聚合
aggregation-compatible
FAIR 原则全部满足
The four numbers above are all directly computed from the mounted federation: /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/json/. Every slide that follows refers back to real data, not hypothetical examples.
Epigraphy in 20172017 年的铭文学
"It is our aim to ensure that such publication is not just driven by considerations of economy or space, but is developed to meet the academic requirements... The history of epigraphy makes it quite clear that such transitions are natural to the discipline." — written in 2009, projecting where the field needs to be by 2017.
The Leiden conventions are typographic — not semantic 莱顿规约是排版规约,不是语义规约
Leiden (1931) describes how an inscription should look when printed. Square brackets mean letters were lost from the stone. Subscript dots mean a letter is unclear. Underlining means a previous editor read text the current editor cannot see.
When you put a Leiden text into a database as a text string, the information collapses. Cayless gives the example of underlined text in a flat database field: the underline carries semantic information ("a previous editor read this") but cannot be stored in a plain-text column, so the convention has to be hacked — often with an underscore character, which then conflicts with any other use of underscores.
The XML is more verbose; that is the cost. The benefit is that every editorial decision is now machine-distinguishable: reason="lost" versus reason="abbreviation" versus reason="damage" are three separate categories that a flat [...] collapses into one.
What the Leiden sigla mean — and how XML captures the meaning 每个莱顿符号对应的语义,以及 XML 如何标记
Click any siglum below to see what it encodes in the inscription and what flat text loses:
Click a siglum above
Real data · what Sicily's two libraries actually look like
Use the mounted folder /inscription_databases/EDCS_ETL-master/ (the SDAM ETL of EDCS) and the mounted I.Sicily corpus. Sicily's two "libraries" are both there. Compare what each preserves about the same province:
| 1,417 | tituli sepulcrales | (funerary) |
| 1,327 | viri | (records mentioning men) |
| 730 | tituli fabricationis | (maker's marks) |
| 607 | sigilla impressa | (stamps) |
| 482 | tituli possessionis | (ownership) |
| 365 | mulieres | (records mentioning women) |
| 327 | tria nomina | (three-part Roman names) |
| 281 | tituli sacri | (sacred) |
/m0-demonstrator/data/xml/isicily/ISic000118.xml:
| 17,073 | characters of EpiDoc XML |
| 275 | lines of source |
| 4 | line-of-text <lb> elements |
| 6 | named editors in <respStmt> |
| 11 | change entries in <revisionDesc> |
| 1 | conglomerate-block material with EAGLE URI |
| 1 | Pleiades place reference (Thermae Himeraeae) |
| 1 | geo-coordinate (37.98365, 13.69555) |
| 5 | identifier crosswalks (TM, EDR, EDCS, URI, DOI) |
Both are useful. EDCS Sicilia is the only place to learn that funerary inscriptions are 23% of the province; ISic000118 is the only place to learn what one specific funerary inscription says, in what state, on what stone, edited by whom, on what dates. Neither replaces the other; both have to be preserved. EDCS preserves pattern at the cost of individuality. EpiDoc preserves individuality at the cost of pattern-readability. Cayless's argument is that the two registers are differently usable, and that the encoded register has been systematically undervalued because its preserved structure is less visible at first glance.
What XML enables that databases cannot XML 能做、平面数据库做不到的事
Cayless frames the discussion around John Unsworth's seven "scholarly primitives": discovery, annotation, comparing, referring, sampling, illustrating, representing. Database-driven sites do discovery well — full-text search — but the other six suffer when sigla and structure are flattened away.
| Scholarly primitive | What flat text supports | What XML adds |
|---|---|---|
| Discovery · search | full-text matching | type-aware queries: "find liberti only when abbreviated", "every text dated after 200 CE", "every restoration marked cert="low"" |
| Annotation | marginal notes lost on copy | standoff annotation persists with the data, citable to line and word |
| Comparing | side-by-side reading | computed diff: lemma-overlap, formula-overlap, prosopographical overlap, all across thousands of texts |
| Referring · citation | page number, corpus abbreviation | stable URI / DOI / Trismegistos number resolving to the specific edition |
| Sampling | flip through | filtered subsets: "all Christian-period funerary stones from Tripolitania", "all stelai whose dating chain mentions an emperor" |
| Illustrating | plates and figures | IIIF image regions linked to text segments, queryable |
| Representing | the edition as a printed page | the edition as a structured object — multiple parallel views (diplomatic, normalised, translated, lemmatised) from one source |
Inscriptions as complex digital packages — not spreadsheet rows 铭文是复杂数字对象,不是表格行
Cayless's closing argument: an inscription is "a text situated in a complex environment" — it has history, support, find-context, palaeography, language, scholarly genealogy. Treating it as a row in a spreadsheet is treating the rich object as the impoverished proxy.
The 2017 vision: an epigrapher will compose a local corpus by drawing from multiple online repositories, load it into research tools, analyse it, then publish the dataset alongside the article. "None of this will be possible unless information is published in such a way that it is not concealed behind an interface, but is in addition retrievable in bulk."
2025 reality: I.Sicily, IRT2021, IRCyr2020 ship XML on GitHub under CC-BY-4.0; Trismegistos provides cross-corpus identifiers; EpiDoc is the de facto standard; the federation lemma resource of the very demonstrator that hosts this slide deck aggregates 59,641 hand-attested lemmas across four corpora. The vision was largely achieved — slightly later than 2017.
Take-away for Week 12 · 本周要点
The student writing inscription data in EpiDoc is not encoding for output's sake. They are preserving editorial judgement as machine-actionable data. Every <supplied reason="lost"> they write is one more record in a dataset that future students, future tools, and future LLMs can use to ask questions that have not yet been asked.
EpiDoc: Epigraphic Documents in XML
for Publication and InterchangeEpiDoc: 用于出版与交换的 XML 铭文文档
The canonical introduction to EpiDoc from one of its architects. "EpiDoc specifies the use of XML, Extensible Markup Language, an industry standard maintained and documented by the World Wide Web Consortium for communication and storage of structured data."
EpiDoc = TEI subset for inscriptions, optimised for interchange EpiDoc = 面向铭文的 TEI 子集,专为交换而设
EpiDoc is not a new schema. It is a specialised set of guidelines for using the Text Encoding Initiative (TEI) XML — the same XML standard used in the literary, linguistic, and manuscript communities. Building on TEI means maximum compatibility with the wider digital-humanities ecosystem.
EAGLE-conformant EpiDoc: enough metadata to convert into EDH / EDR / EDB.
The architectural separation matters: the data (what the inscription is, what was lost, what was restored, who edited it, when, on what authority) is separate from the view (HTML for the web, PDF for print, audio for the visually impaired, a database row for aggregation). One XML source supports many views; flat-text publication couples the two and discards everything that wasn't needed for the printed page.
One inscription, three ways to read it 一篇铭文的三种读法
Bodard's worked example is the Aphrodisias text ALA 2 — a fragmentary statue base honouring Salonina (Julia Cornelia Salonina Augusta). The text below is the one published in Aphrodisias in Late Antiquity (Roueché 1989, ALA 2).
The XML is verbose; but every decision now has a name. <lb type="worddiv"/> says a word is broken across the line-end. <unclear> says this letter can be read but with some doubt. <supplied reason="lost"> says I, the editor, supplied this restoration because the surface was lost. <g type="scroll"/> says there is a non-textual glyph (a scroll mark) at this location on the stone. The flat [...] said only "there is something here you cannot see."
The same data, additional layers
Once a word is in the XML, additional information can be attached non-destructively. Two examples from Bodard's discussion:
Real editorial chain · ISic000118 over 9 years
Bodard argues that EpiDoc captures information no print edition could carry: the full editorial history of a record, with named contributors and date stamps. Here is the actual <revisionDesc> from ISic000118.xml in the mounted demonstrator — a single funerary epitaph whose digital edition has been touched 11 times by 6 named editors over 9 years:
Each who="#JCu" reference resolves to a <respStmt> earlier in the file. Some of those entries carry ORCID identifiers — Prag's http://orcid.org/0000-0003-3819-8537, Stoyanova's 0000-0003-3914-9569, Cummings's 0000-0002-6686-3728, Ahlholm's 0000-0001-8417-7089, Crellin's 0000-0002-0100-7437, the petrographer Coccato's 0000-0002-6641-2820 — making the editorial history not just citable but globally addressable to real persistent researcher identities.
One source, many outputs — via XSLT 单一源数据,多种产出,经由 XSLT
The capability that makes structured preservation worth the cost: multiple deliverables from one source, generated automatically. A single XSL Transformation script can:
- HTML for the web — Leiden-style transcription with interactive footnotes
- PDF for typeset print — print-on-demand volume from the same data
- Diplomatic version — strip out editorial restorations to show what is actually on the stone
- Indexed onomasticon — automatically generated from
<name>elements across the corpus - Concordance — every attestation of every
<w lemma="...">grouped by lemma - Audio — for visually impaired readers, with editorial conventions read aloud
- Database row — for aggregation into EDH, EDR, EDCS or Wikidata
- Translation alignment — line-by-line parallel text in any other language
The corollary: every transformation is reversible. If a new question requires a different output, the XSLT is rewritten, the same XML is re-transformed, and the new output appears. The data has not been touched. The data does not need to be touched, because the data already preserves the structure the new question depends on.
What a project needs · what a scholar needs to learn 项目所需 · 学者需学的技能
Bodard is honest about the cost. EpiDoc requires more skill than typing Leiden into Word. But the skills are not technical novelties — they are editorial decisions made explicit.
- Scholar: learns to read EpiDoc XML well enough to review work; understands what each tag means as an editorial distinction; engages with the EpiDoc community via lists and fora.
- Programmer / RA: customises XSLT for the project; adapts CHET-C or similar Leiden→XML converter; ties the output into a web design.
- Web designer: wraps the XSLT-generated HTML into a coherent site.
- Not needed: a proprietary database system that locks the data away; reinventing tag conventions; software that depends on a single vendor.
Take-away for Week 12 · 本周要点
Bodard's paper is the practical companion to Cayless's vision. Cayless asks why preserve; Bodard answers how. When the student in Week 12 writes their first <supplied reason="lost">, they are doing exactly what Bodard's paper describes — turning editorial judgement into machine-readable structure. The verbosity is the visible cost; the reusability is the invisible benefit, paid back many times over as the same data is rendered, indexed, aggregated, translated, and queried.
cert="low", or count abbreviations expanded by type, or list every editor who has touched a given file via <revisionDesc> — every such tool can do so without re-editing the inscriptions. Because the data is there.Integrating Palaeographic Research
into the Digital Epigraphy of Multilingual Sicily将古文字学整合进西西里多语铭文的数字化研究
Beyond text: what happens when palaeography — the shape of the letters, the surface they were cut into, the choices the cutter made — also becomes structured, queryable, computable data.
An inscription has more than text · 一篇铭文不止是文本
Sicily across 1,500 years carries Greek, Latin, Punic, Elymian, Sikel, and Oscan writing on stone, ceramic, brick, metal, and wax. The Crossreads project asks what happens when every dimension of these inscriptions — not just the text — is preserved as linked structured data.
<material> with EAGLE / Getty AAT URIs.None of these questions is answerable from a flat-text corpus. They require palaeographic annotation, material identification, geographic data, dating, and language classification to be linked simultaneously. The EpiDoc structure is the substrate that makes the linkage possible.
Connecting letterforms to text, material, date, place 将字形与文本、材质、年代、地点相互连接
Stoyanova & Prag describe what they call the "main challenge" — designing an effective mechanism to integrate information from multiple datasets to enable rich and complex queries. The integration is achievable only because each layer was, individually, preserved as structured data with stable identifiers and linked-open-data anchors.
An example from a real I.Sicily file (ISic000118, a funerary epitaph from Thermae Himeraeae) shows what structured preservation of just one dimension — material — looks like:
This single element preserves: the material name in English (conglomerate); a taxonomy anchor (#material.inorganic.stone); a typological classification (type, subtype); the petrographic specialist who identified it (Coccato, with ORCID elsewhere in the file); and a URI to the EAGLE vocabulary entry that aligns this term with every other epigraphic project's controlled vocabulary for materials.
Crossreads extends this pattern to letterforms
Each letterform observation — alpha with broken bar, alpha with straight bar, sigma four-bar versus lunate — is recorded against the IIIF image region, attributed to the palaeographer who made the observation, anchored to the EpiDoc text-position via stand-off markup, queryable against the material, date, place, and language of the inscription. The palaeographer's judgement becomes data the way the editor's restoration became data in Bodard's example.
Every research dimension is a candidate for structured preservation 每一研究维度都可成为结构化保存对象
The Crossreads project demonstrates a pattern that runs through all six papers in this lecture: any aspect of an inscription that a scholar observes carefully enough to publish, is also an aspect that — if preserved as structured data — supports computational analysis at scale.
| Observation | Print form | Structured form (XML/LOD) |
|---|---|---|
| The text | Leiden brackets | <supplied reason="lost">, <unclear>, <expan>... |
| The material | "marble" in the description | <material> with EAGLE URI + responsible specialist |
| The findspot | "Termini Imerese (anc. Thermae Himeraeae)" | <origPlace ref="pleiades:462513"> + geo-coordinate |
| The date | "Imperial" in a footnote | <origDate notBefore="0001" notAfter="0250" cert="low"> |
| The hand | "the cutter was inexpert" | <handNote> with measurements + IIIF region links to letter samples |
| The letterform | "alpha with broken bar" — in a plate caption | Archetype annotation, image-region-addressable, queryable by date and material |
| The editorial chain | bibl entries in tiny print | <respStmt> with ORCID + <revisionDesc> with timestamped changes |
Real evidence · different corpora preserve different things
The rubric scored at population scale (rubric_full.json, 10,249 source files across all four corpora) shows a striking pattern: encoding traditions specialise. Each corpus is strong on different rubric axes, exactly as Stoyanova & Prag predict for a multilingual, multi-period, multi-material region like Sicily — where no single encoding decision is universally optimal.
| Rubric axis (median per corpus) | I.Sicily (n=4,782) | IRCyr (n=2,360) | IRT (n=1,618) | iAph (n=1,489) |
|---|---|---|---|---|
| ① Semantic layers · 不同内容层数 | 10 🟢 | 9 | 9 | 7 |
| ② URI density · 外向 LOD 链接 | 12 🟢 | 5 | 6 | 0 ⚠ |
| ③ respStmt depth · 编辑责任链深度 | 4 | 7 🟢 | 4 | 6 |
| ④ Citability granularity · 引用粒度 | 2 | 3 | 3 | 3 |
| ⑤ Question coverage · 可回答问题类 | 6 | 9 | 10 🟢 | 8 |
| ⑥ Provenance · 出处可追溯 | 3 | 3 | 3 | 1 |
| ⑦ Bibl genealogy · 学术谱系深度 | 4 | 2 | 4 | 4 |
| Total median | 42 | 38 | 38 | 28 |
No single corpus dominates every axis. Stoyanova & Prag's argument for multi-modular preservation — text + linguistic annotation + petrography + IIIF + palaeography — extends this insight: different research dimensions require different preservation moves, and the encoding format has to support all of them simultaneously without forcing trade-offs.
Take-away for Week 12 · 本周要点
EpiDoc is not only for the inscription's text. It is for everything the scholar observes about the inscription — including aspects (letterforms, petrography, editorial hands) that traditional print could not even document well. The XML structure is what makes those observations linkable to each other, and therefore queryable, and therefore actually usable for the kinds of large-scale historical and linguistic questions the field wants to ask.
ISic000118 referenced on this slide ships in this very workshop bundle at /data/xml/isicily/ISic000118.xml. Open it: you will find the full material, palaeographic, geographic, and editorial-chain structure preserved exactly as Stoyanova & Prag's framework describes.
Describing Inscriptions of Ancient Italy:
The ItAnt Project and Its Information Encoding Process描述古意大利铭文:ItAnt 项目及其信息编码流程
The detailed case-study of customising EpiDoc for fragmentary, non-canonical languages (Oscan, Faliscan, Venetic, Cisalpine Celtic) where the standard tagset, designed for Greek and Latin, requires deliberate extension — without breaking interoperability.
What if your inscription is in Faliscan, not Latin? 如果你的铭文不是拉丁文,而是法利斯语呢?
The ItAnt project (Italian PRIN, CNR-ILC Pisa + Ca' Foscari Venice + University of Florence) focuses on the languages of pre-Roman Italy attested only in epigraphic form — Restsprachen, "remnant languages." Their epigraphic record is fragmentary, often uncertain in reading and segmentation, and presents challenges the standard EpiDoc tagset (built first for Greek and Latin) does not directly handle.
| Language | Period | Script(s) | Particular challenges |
|---|---|---|---|
| Oscan | 5th c. BCE – 1st c. CE | Oscan, Latin, Greek | Same language in three different scripts → cannot collapse language and script into one attribute |
| Faliscan | 7th – 2nd c. BCE | Faliscan, Latin | Scriptio continua (no word division on stone) → word-boundary decisions are editorial |
| Venetic | 6th c. BCE – 1st c. BCE | Venetic (North Italic) | Syllabic punctuation marks syllables, not words → a notation system with no EpiDoc precedent |
| Cisalpine Celtic | 6th c. BCE – 1st c. CE | Lepontic, Latin | Linguistic identification of fragments is itself contested → uncertainty has to be encoded |
Murano et al.'s response is not to invent a new schema. It is to extend EpiDoc with carefully chosen additions that record exactly the distinctions these materials require — preserving them as structured data, while keeping interoperability with the international EpiDoc community.
The ItAnt EpiDoc extensions ItAnt 的五项 EpiDoc 扩展
<rs> inside <scriptNote> with @type values: scriptio continua, punctuation, blank spaces, mixed. Records the editor's segmentation decision as data.<rs> inside <support> for the object's shape and possible ancient reuse. Records material-cultural information lost in plain catalogues.@ident on <language>. ItAnt keeps the standard for compliance, but adds a separate <rs> in <scriptNote> recording the script independently. Oscan-in-Greek-script ≠ Oscan-in-Latin-script ≠ Oscan-in-Oscan-script.@type: praenomen, gentilicium, patronymic, etc. A @ref attribute ties together the components of a formula even across word-boundaries or shared components, with the full formula resolved in <listPerson> within the commentary.<w> gets a transparent @xml:id like Fal_6_l_1_w_2 = "second word of line 1 of the sixth Faliscan inscription in the collection." The ID is both machine-parseable and human-readable — anyone can navigate the corpus by name.<rs> elements with typed @ana or @type attributes). The output XML is still EpiDoc-conformant. Standard EpiDoc tooling can read it. New ItAnt-aware tooling can extract the customised information.
What flat-text would discard 平面文本会丢弃什么
Consider an Oscan inscription written in Greek script. In a flat-text edition you would see the Greek letters and a footnote saying "Oscan." In EpiDoc with ItAnt's customisations:
That XML says, machine-readably:
- The language is Oscan (
ident="osc") — searchable, aggregateable, joinable to Wikidata and linguistic resources. - The script is Greek (in its Oscan-adapted form) — independently queryable. "Show me all Oscan inscriptions written in Greek script" is one query; "show me all Oscan inscriptions written in Oscan script" is another. Both are answerable.
- Words are separated by punctuation, not by spaces — so any text-processing tool knows how to tokenise this particular inscription correctly.
Real example · how fully-marked-up Greek looks at scale
The principle Murano et al. apply to fragmentary Italian languages applies equally to canonical Greek and Latin. Here is a fragment from A.30.xml (Anastasius I's imperial edict at Apollonia, c. 491–518 CE) in the mounted IRCyr corpus — every word marked with <w lemma="...">, every name marked with <persName> + <name> + nymRef, every abbreviation marked with <expan> + <abbr> + <ex>, every lost letter marked with <supplied reason="lost">:
Counting only Greek-script <w lemma=> elements across the four EpiDoc inscription corpora, the demonstrator's federation lemma resource federation_lemmas_full.json records 41,411 attestations. That is the linguistic substrate ItAnt-style customisations enable at federation scale. Each token-lemma pair preserves: surface form, normalised surface, lemma, normalised lemma, line number, file, corpus, language, provenance, licence. No print edition could carry this density.
A.30.xml referenced here lives at /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/xml/ircyr/A.30.xml (mounted) and at /Users/chingyuanwu/Documents/epidoc/kcl_tei/cyrenaica/A.30.xml (also mounted). Try opening either — count the <w lemma=> elements. There are 177 of them in this one edict, every Greek word individually preserved with its dictionary form.
Beyond EpiDoc: connecting to the cultural-heritage semantic web 超越 EpiDoc:接入文化遗产语义网
ItAnt does not stop at extending EpiDoc. Murano et al. also encode the inscriptions against the CIDOC CRM conceptual reference model (the ISO 21127 ontology used across European cultural-heritage institutions) and its extensions CRMtex (for ancient texts) and CRMinf (for scholarly arguments and inferences). And bibliographic metadata is encoded in FRBRoo / LRMoo.
The result is a record that is simultaneously: an EpiDoc inscription, a CIDOC-CRM-typed cultural object, a CRMtex-typed text-bearing artefact, a CRMinf-typed scholarly argument, and an LRMoo-typed bibliographic citation. Each lens supports different queries. The data exists once; the queries — and the interoperability with non-epigraphic cultural-heritage databases — multiply.
Take-away for Week 12 · 本周要点
Murano et al.'s paper shows that EpiDoc is not a monolithic standard you accept or reject. It is a foundation that extends to your specific research needs. The lesson for the student is twofold: first, every customisation must be principled (use <rs> with controlled vocabularies, not invent new elements); second, every customisation should serve preservation — adding categorical distinctions the standard does not yet carry, never collapsing distinctions the standard already records.
Domain-Specific Languages for Epigraphy:
The Case of ItAnt面向铭文的领域专用语言:以 ItAnt 为例
The companion paper to Murano et al. Once you have committed to preserving all the data in EpiDoc XML, the question becomes: how do you make it humane to write? Answer: a Domain-Specific Language that compiles down to EpiDoc.
EpiDoc XML preserves everything — and is painful to write EpiDoc XML 保留一切,也极难手写
The cost of preservation is verbosity. Each <supplied reason="lost"> is twenty-four characters of tag for one character of editorial decision. Each opening tag matches a closing tag. Each attribute name and value must be written in full. The percentage of information content to structural scaffolding is unbalanced — and human readability decreases rapidly as complexity increases.
In ItAnt, the data density is high: linguistic, philological, and prosopographical information overlaps. A lacuna spans the end of token 3 and the start of token 4. A named entity (praenomen partially conjectured, gentilicium, patronymic) spans tokens 4 through 6. These overlapping hierarchies are the well-known weakness of XML, addressed by stand-off markup or alternative representations — but both add complexity for the human encoder.
Boschetti et al. test this with a real philologist (linguistics graduate, epigraphic competence, basic DH skills only). She encodes five Faliscan inscriptions twice — once in raw EpiDoc, once in ItAntDSL. The results are qualitative but consistent: training time on the DSL is significantly shorter; encoding time is markedly lower; DSL documents are approximately three times more compact than the EpiDoc counterparts.
Concrete · how big is one EpiDoc file, really?
Look at the actual ISic000118.xml in the mounted demonstrator. This is one funerary inscription. Four lines of Latin on the stone. What is preserved in XML, measured in bytes:
| Measurement | Value | What it tells us |
|---|---|---|
| The inscription itself | 4 lines of Latin (~50 words) | What goes on the printed page |
| EpiDoc XML file size | 17,073 characters | ~400× the size of the bare text |
| Lines of XML source | 275 lines | ~70× the line count of the inscription |
| Named editors in respStmt | 6 | Provenance preserved per contribution |
| Change entries in revisionDesc | 11 | Editorial history preserved |
| External LOD URIs (Pleiades, EAGLE, EDR, TM, DOI, ORCID) | 12+ | Linked into the international graph |
How the DSL becomes EpiDoc XML DSL 如何转译为 EpiDoc XML
human-written
proprietary schema
merging with YAML
output
Key design choices that make this work:
- ANTLR for parsing. The DSL has a formal context-free grammar (available on GitHub at
CoPhi/itantdsl). The parser produces an Abstract Syntax Tree from which an intermediate XML representation is generated. - Six design dimensions optimised (Zenzaro et al. 2022): familiarity, transparency, completeness, compactness, consistency, actionability. The DSL wins on the first three; XML wins on actionability; both target completeness and consistency.
- Shared metadata in YAML. Information repeated across every inscription (language definitions, script definitions, vocabulary aliases) lives once in a YAML look-up file. The DSL references it by key. The XSLT expansion merges it in. Each inscription stays compact.
- EpiDoc is the canonical artefact. The DSL is the editorial surface. The XML is the data that ships. ItAnt does not invent an alternative standard — the international ecosystem still receives standard EpiDoc.
Ergonomics serves preservation — does not bypass it 人机工效服务于保留,而非绕开保留
The temptation, when faced with EpiDoc's verbosity, is to say "I'll just use a flat text and skip the XML." That is the move Boschetti et al. explicitly reject. The DSL is faster and more compact and easier to learn, but it produces the same full EpiDoc XML the field expects. The structure is preserved end-to-end.
Connection to your Week 12 hands-on · 与本周实践的关系
When you first try to write inscription data in EpiDoc, you will feel the friction Boschetti et al. document. That is normal. The friction is a sign that you are being asked to articulate distinctions that flat-text writing simply did not require you to make. The friction is the cost of preservation.
For your first attempts, write the EpiDoc by hand. Feel where it pinches. Notice which decisions feel laborious (line breaks? abbreviations? names?). Then — and only then — would it be sensible to look for or build a DSL or template editor for your specific corpus. The DSL is a useful tool, but the EpiDoc literacy is the foundational skill, and the friction of writing it raw is part of how you build that literacy.
Papyrological Editor (papyri.info): the canonical community-curation workflow for EpiDoc, with versioning and peer review.
ItAntDSL: domain-specific editor for fragmentary Italian languages, deposited at
hdl.handle.net/20.500.11752/ILC-1003.Leiden ⇄ EpiDoc playground: shipped in this workshop bundle as a sibling file to this slide deck — a workshop-grade DSL surface for trying out the conversion live.
Open Scholarship:
Epigraphic Corpora in the Digital Age开放学术:数字时代的铭文语料库
The synthesis paper. Three of the field's most senior figures, looking back across the corpora tradition from Boeckh and Mommsen to FAIR Epigraphy, argue that the largest obstacle facing digital epigraphy is no longer technical: it is cultural.
From CIG (1815) to FAIR Epigraphy (2022): why corpora exist 从 CIG (1815) 到 FAIR 铭文学 (2022):语料库为何存在
Boeckh's 1815 proposal to the Berlin Academy already had the core elements: comprehensive geographic coverage, autopsy-based editions, indices of names. What he could not have anticipated: that the 19th-century print model would, two centuries later, encounter limits that digital methods now have to address.
Bodel/Prag/Roueché's claim: digital methods, in particular EpiDoc TEI for encoding and Linked Open Data for connection, enable epigraphic corpora to capture far more than Leiden or its codex-bound print conventions could ever carry. The question is no longer whether digital is possible — it is whether the community is willing to adapt its practice.
Why "CIL I² 1221 = VI 9499 = ILS 7472 = CLE 959 = ILLRP 793" is wrong 为何"等号串联多版编号"的引用方式是错误的
One of the paper's sharpest critiques targets a practice almost every epigraphist has engaged in: citing an inscription by stringing together several corpus numbers separated by equals signs, as if they were identifiers for the same text rather than separate editions by separate scholars at separate times.
("When the editor is not the author of the transcription, has not seen the stone or the squeeze... one must indicate with care whose copy it is; this is essential for criticism.")
The citation chain conceals authorship. It treats edition as if it were identifier. And in the digital age this practice is, if anything, more dangerous. The EDCS entry might be a copy of an existing edition, or it might be modified — without a date, without a named editor, without a documented method. The PHI text might or might not reflect a particular published edition. The user has no way to know.
What structured preservation enables instead
| Resource | Citation level | What it tells you |
|---|---|---|
| CIL I² 1221 | print corpus, page 1221, vol. 2 of edition 2 (1918) | Mommsen autopsy, recorded "descripsi" |
| EDR167214 (Butini 2022) | specific digital edition, attributed, dated | Author, revision date, source-edition explicit; checked against image |
| TM574526 | abstract identifier, not an edition | Inscription-in-the-abstract pointer; resolves to a list of all known editions |
| EDCS-19200211 | opaque | Some text, some metadata, no attribution, no date, no method |
The contrast is stark. EDR's edition is structured, attributed, and citable to a particular scholar at a particular time, because the EpiDoc XML preserves those things. EDCS's entry — even though it exists in the same digital medium — is opaque because the underlying data structure did not preserve provenance. The opacity is not because the medium is digital; it is because the data was not structured to preserve authorship and method.
From IRT (1952) → IRT2009 → IRT2021 — preservation enabled iteration 从 IRT 印本 (1952) 到数字版 IRT2021,保留促成迭代
Joyce Reynolds and J.B. Ward-Perkins published The Inscriptions of Roman Tripolitania in 1952, post-war, with limited photography (38 plates for 1000 texts). Louis Robert promptly noted the limitations. The volume served the field for half a century.
In 2009 the British School at Rome, ISAW (NYU), and KCL collaborated to publish IRT2009 online — the 1952 text re-encoded in EpiDoc XML with images, translations, and structured metadata. Because the encoding preserved structure, when scholars wanted updates, those updates could be applied surgically. Joyce Reynolds provided English translations. Place identifications got Pleiades URIs. Editorial decisions stayed traceable.
In 2021, the same data was enriched: every Greek or Latin inscription from Tripolitania published since 1952 was added. The result, IRT2021, took less than 12 months to produce. Standard EpiDoc encoding meant the existing data did not have to be re-keyed. New inscriptions slotted in. Indices auto-regenerated. Wikidata links to LGPN (Lexicon of Greek Personal Names) and PIR (Prosopographia Imperii Romani) propagated.
1000 texts
38 plates
+ images
+ translations
+ Wikidata links
+ LGPN / PIR
aggregator pilot
This is what Cayless's 2009 vision of "true datasets able to be queried, mined, and transformed" looks like in practice, fifteen years later. The corpus is no longer a static publication. It is a living, citable, versionable dataset, and the cost of updating it is bounded by the cost of writing new editions — not by the cost of resetting the entire print.
The future of the corpus: dynamic, federated, FAIR 语料库的未来:动态的、联邦的、FAIR 的
Bodel/Prag/Roueché's closing proposal is an inversion of the existing corpus model: instead of a single monumental corpus (CIG, CIL, IG) covering everything, build directly on the proliferation of local and thematic corpora (IRT, IRCyr, I.Sicily, USEP, PETRAE, etc.) and connect them via standards.
The principles · FAIR data
- Findable: unique persistent identifier (DOI, TM, URI), rich metadata in a searchable resource
- Accessible: retrievable by its identifier, using open/free protocols, with metadata accessible even when the data is restricted
- Interoperable: shared formal vocabularies (EAGLE, FAIR Epigraphic Vocabularies, Getty AAT), language-aware, machine-actionable
- Reusable: clear licensing, accurate provenance, community standards, rich documentation
The pilot · inscriptiones.org
The FAIR Epigraphy project (AHRC + DFG, dir. Horster & Prag) has demonstrated the pattern. A software engineer with no epigraphic background, working a few hours, transformed XML files from Roman Inscriptions of Britain (RIB), Inscriptions of Greek Cyrenaica (IGCyr), and I.Sicily into RDF using a community-built ontology. The output is queryable at https://inscriptiones.org/.
What this looks like in the mounted demonstrator
The same demonstrator that hosts these slides ships its own Pelagios TTL serialisation at /data/json/federation_pelagios.ttl, exporting all 56 records as oa:Annotation objects linked to pleiades: places. This is the actual output format Bodel/Prag/Roueché describe — a real RDF file, openly downloadable, sitting at federation scale:
Each inscription becomes a tuple in the Pelagios graph. Each places-link makes the inscription visible to Peripleo and any other Pelagios consumer. The full federation TTL is 336 triples across 56 records — a small file, but a working contribution to the international LOD graph that aggregators like inscriptiones.org and the Trismegistos cross-reference graph can ingest.
Real proof · the international encoding tradition is internally consistent
One of Bodel/Prag/Roueché's underlying assumptions — that structured preservation across corpora is meaningful enough to aggregate — is testable from the same demonstrator data. The cross-encoder consistency analysis (cross_encoder_full.json) measures: across 41,411 hand-attested Greek lemma instances from four EpiDoc inscription corpora, how often do different editorial teams give the same Greek surface form the same lemma?
What this means: the field's editorial-practice cohesion is high enough that pooled benchmarks of Greek lemmatisation are defensible. Aggregation across corpora is methodologically possible. The preserved structured data is consistent enough to support cross-corpus computation — exactly the move Bodel/Prag/Roueché argue should be possible in principle.
±2.14pp (95% CI)
n = 1,042 multi-attested surfaces
cross_encoder_full.json in mounted demonstratorWhat the community has to do · 学术共同体应做的事
- Cite digital editions properly: author, date, version, URL
- Treat each edition as a scholar's editorial work — not as an opaque database row
- Agree on community standards for categories (inscription type, material, support) — leaving how the categories are represented free
- Document the construction of editions: who recorded the metadata, who took the photograph, who compiled the bibliography
- Use ORCIDs to track contributions in collaborative environments
- Adopt versioning (Zenodo deposits, GitHub history, semantic versioning of corpus editions)
- License data openly (CC-BY-4.0 is the field's emerging default)
- Stop citing inscriptions by equals-sign chains that obscure authorship
- Stop publishing data in HTML-only interfaces with no bulk export
Take-away for Week 12 · 本周要点
When you write inscription data in EpiDoc for the first time this week, you are participating in the system Bodel/Prag/Roueché describe. Your editorial decisions become part of a federation that — if the community standards hold — can be cited, versioned, aggregated, queried, and re-used by anyone in the field, forever. The cost of preservation is real. The dividend is the world the three authors are arguing for: a research culture in which our data is as open, citable, and re-usable as our arguments have always claimed to be.
What the six papers say together 六篇论文的合鸣
| Paper | Year | Core argument · 核心论点 |
|---|---|---|
| Cayless | 2009 | Flat-text encoding of Leiden discards semantic information; XML preserves it as a parse tree that supports the scholarly primitives beyond simple search. |
| Bodard | 2010 | EpiDoc is the operational form of Cayless's argument: a TEI subset that preserves editorial decisions as structured data, supports multiple outputs from one source, and connects to the wider digital-humanities ecosystem. |
| Stoyanova & Prag | 2021 | The preservation principle extends beyond text: palaeographic observation, petrographic identification, IIIF image regions, and linguistic annotation all become first-class structured data in the Crossreads framework. |
| Murano et al. | 2023 | EpiDoc is not Greek-and-Latin-only; with principled customisation (typed <rs> elements, controlled vocabularies, transparent IDs), it serves the fragmentary languages of pre-Roman Italy without breaking interoperability. |
| Boschetti et al. | 2024 | The ergonomic cost of writing EpiDoc by hand is real; a Domain-Specific Language can serve as a friendlier editorial surface, but the canonical output remains EpiDoc XML — preservation is never bypassed, only made more humane. |
| Bodel/Prag/Roueché | 2024 | The technical infrastructure for federated, FAIR, open epigraphic corpora exists. What remains is cultural: rigorous citation, community-agreed standards, named authorship, versioning, open licensing. |
Now — your hands-on · 现在,动手实践
Try writing inscription data in EpiDoc. Pick a short inscription from the mounted /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/data/xml/ corpus — perhaps ISic000118 (Latin epitaph) or A.30 (Greek imperial edict from Cyrene) — open its existing EpiDoc XML, and read it line by line. Then try writing your own short EpiDoc record from scratch for a Greek or Latin text of your choosing. Notice what the standard tagset gives you. Notice what your specific corpus might require beyond the standard. Notice which editorial decisions you are now recording as data that you might previously have left in your head or in a footnote.
The data you write this week is preservation infrastructure. Future you, and future others, will be able to ask it questions you have not yet thought of.
Mounted folders for hands-on work: /Users/chingyuanwu/Documents/epidoc/isicily/ · /Users/chingyuanwu/Documents/epidoc/kcl_tei/ (Aphrodisias, Cyrenaica, Tripolitania) · /Users/chingyuanwu/Documents/epidoc/inscription_databases/. The companion Leiden ⇄ EpiDoc playground is shipped at /Users/chingyuanwu/Documents/epidoc/isicily/m0-demonstrator/leiden-playground.html.
Concrete artefacts to open today 今日可亲手打开的具体文件
Every claim in this deck was anchored in real files. The mounted /Users/chingyuanwu/Documents/epidoc/ folder hosts all of them. Here is the suggested walkthrough order:
| For paper | File · 文件路径 | What to look at |
|---|---|---|
| §1 Cayless · §2 Bodard | isicily/m0-demonstrator/data/xml/isicily/ ISic000118.xml |
Open in any text editor. Find the <revisionDesc> at the top. Count the <change> elements. Find the <material> element. See how it carries an EAGLE URI. This is the level of preservation EpiDoc operationalises. |
| §2 Bodard · §4 Murano | isicily/m0-demonstrator/data/xml/ircyr/ A.30.xml |
Find the Anastasius edict text. Count <w lemma=> elements (177 of them). See how each Greek word is preserved with its dictionary form. See <persName type="emperor" key="anastasius"> linking to a person authority. |
| §2 Bodard · §6 Bodel/Prag/Roueché | isicily/m0-demonstrator/data/xml/iaph/ iAph110305.xml |
The Aphrodisian prize-list whose bibl chain reaches back to Sherard 1716 via Boeckh's CIG. See <bibl n="CIGII">Boeckh from Sherard, CIG 2758 A-G</bibl>. Three centuries of editorial scholarship preserved in machine-readable form. |
| §3 Stoyanova & Prag | isicily/m0-demonstrator/data/json/ rubric_full.json |
Population-scale rubric scoring of 10,249 source files. Look at per_corpus_aggregate → see the seven-axis profiles for each corpus. The "encoding traditions specialise" claim is right there in the numbers. |
| §3 Stoyanova & Prag · §5 Boschetti | isicily/m0-demonstrator/data/json/ federation_lemmas_full.json |
12 MB · 59,641 attestations · 6,285 distinct lemmas across 4 corpora. CC-BY-4.0. This is the federation-scale linguistic substrate. Load it in Python; aggregate by corpus; aggregate by language; aggregate by lemma. Every entry preserves its file, line, surface, normalised surface, and provenance. |
| §4 Murano · §6 Bodel/Prag/Roueché | isicily/m0-demonstrator/data/json/ federation_pelagios.ttl |
336 RDF triples — Pelagios serialisation of the 56-record federation, every record linked to its Pleiades place URI. Open in a text editor; or load into an RDF tool. The actual LOD output format Bodel/Prag/Roueché advocate. |
| §5 Boschetti | isicily/m0-demonstrator/ leiden-playground.html |
The companion editor for trying Leiden+ syntax and seeing the EpiDoc output live. Open in a browser. Try typing the first line of any ISic file. See the XML appear on the right. |
| §6 Bodel/Prag/Roueché | isicily/m0-demonstrator/data/llm_ready/ dataset_card.md |
The HuggingFace-style dataset card for the federation lemma resource. FAIR principles operationalised: license: cc-by-4.0, citations, methods, openly redistributable. This is what a FAIR-compliant epigraphic data deposit looks like. |
| EDCS comparison (Cayless's "flat data") | inscription_databases/EDCS_ETL-master/ data/2022_09_allProvinces/ |
The 537,286-row SDAM ETL of the Epigraphic Database Clauss-Slaby — Latin Mediterranean. CC-BY-NC-SA-4.0. Look at EDCS_2022_dataset_metadata_SDAM.csv in the parent folder for what the flat-data side of the equation actually contains. |
| The KCL source corpora directly | kcl_tei/ (aphrodisias / cyrenaica / tripolitania) |
The three KCL-edited corpora at their canonical locations. kcl_tei/cyrenaica/A.30.xml is the same Anastasius edict; kcl_tei/aphrodisias/iAph110305.xml is the same Sherard→Boeckh→Roueché chain. Open from either path; the EpiDoc XML is byte-identical. |
ISic000118.xml. Identify ten elements you have never seen before. For each, open the EpiDoc Guidelines at https://epidoc.stoa.org/gl/latest/ and look up what the element preserves. Then write a 10-line EpiDoc record for a Greek or Latin inscription of your choosing — use the same patterns. See what you can preserve at line, word, and character level. This is the moment the abstraction becomes a skill.