An atlas of inscription databases · 铭文数据库地图

Where the inscriptions actually live.

铭文真正居住在哪里。

A guided tour of the major projects that publish ancient-Mediterranean inscriptions on the web — the families they belong to, the encoding conventions they use, the identifiers that name them, and the seams where they fail to fit together.

古地中海铭文在网络上的主要发布项目导览，它们所属的"家族"、采用的编码惯例、用以命名记录的标识符系统，以及它们彼此对接处的缝隙。

Aggregators Regional EpiDoc Text-only Greek Italian / federation Papyri cousins ML / NLP datasets

25 slides · 6 families · ~30 databases · use ← → to navigate · click any underlined term

25 张幻灯 · 6 个家族 · 约 30 个数据库 · 使用 ← → 翻页 · 点击带下划线的术语

The orienting question · 起点

Behind every claim about "N inscriptions show…" stands a database, an encoding, a license, a scope

每一句"N 件铭文显示……"的背后，都站着一个数据库、一种编码、一份授权、一块范围

When you read a sentence like "of 12,000 surviving inscriptions from Roman North Africa, 38% mention an emperor", the number 12,000 is the residue of a chain of choices — which database the author queried, which fields they filtered, which encoding the editors used to mark "imperial reference," what license allowed bulk download, and which inscriptions fell outside the chosen scope.

当你读到一句话："罗马北非现存的一万两千件铭文中，有 38% 提及元首"，这个一万二的数字背后，是一条选择链：作者查的是哪个数据库？过滤了哪些字段？编辑用什么编码标记"元首指称"？什么授权允许了批量下载？哪些铭文落在了所选范围之外？

The thesis of this atlas: there is no single "the inscription database." There are families of projects with different goals (regional surveys, aggregators, text-search, ML training), different encodings (TEI EpiDoc, plain text, Dublin Core), different identifier schemes, and different licenses. Knowing the map helps you know what you can responsibly say.

本地图的论点：不存在"那个"铭文数据库。世界上有若干家族：它们目标不同（区域调查、聚合、文本检索、机器学习训练）、编码不同（TEI EpiDoc、纯文本、Dublin Core）、标识符系统不同、授权不同。看懂地图，才能知道哪些话可以负责任地说。

This atlas covers the major projects that the SDAM ETL pipelines (and most digital epigraphy work) draw on. It does not claim completeness — there are smaller and newer projects — but the families it identifies are stable.

本图覆盖 SDAM ETL 管道（以及大多数数字铭文学工作）所依赖的主要项目。它不声称完备，还有更小、更新的项目，但所标识的"家族"分类是稳定的。

Disciplinary lineage (skim before reading the families): the print authority chain from CIL (Mommsen 1853–) and IG (Boeckh 1828–) → PHI CD-ROMs (1985) → web databases (EDH, EDCS, 1995–) → TEI EpiDoc + IIIF (2005–) → open-data + reproducibility (FAIR, 2015–). → Full lineage walk-through in Paper § 1.5学科谱系（在读各家族前略览）：印本权威链 CIL（Mommsen 1853–）与 IG（Boeckh 1828–）→ PHI 光盘（1985）→ Web 数据库（EDH、EDCS，1995–）→ TEI EpiDoc + IIIF（2005–）→ 开放数据 + 可复现（FAIR，2015–）。→ 完整谱系见论文版 § 1.5

Foundations · 基础 · 1/6

What is XML?

什么是 XML？

XML (eXtensible Markup Language) is a way of writing structured data so that both humans and computers can read it. Every piece of content sits inside a labelled box — a tag — and tags can contain other tags, like Russian dolls. The labels carry meaning: a computer reading <date>1883</date> knows the number is a date, not just a string.

XML（可扩展标记语言）是一种把结构化数据写成人与计算机都能读懂的格式。每段内容都装进一个有"标签"的盒子里，盒子可以套盒子，像俄罗斯套娃。标签本身携带语义：计算机读到 <date>1883</date>，就知道里头的数字是"年份"，而不仅仅是一串字符。

The structure

<inscription>
  <title>Stonecutter's sign</title>
  <date>1st century CE</date>
  <text language="latin">
    Tituli heic ordinantur
  </text>
  <text language="greek">
    στῆλαι ἐνθάδε τυποῦνται
  </text>
</inscription>

The vocabulary

Element — a tag pair: <title>…</title>
元素：一对标签：<title>…</title>
Attribute — a name=value pair on a tag: language="latin"
属性：标签内的"键＝值"对：language="latin"
Self-closing — empty element: <lb n="2"/>
自闭合：空元素：<lb n="2"/>
Namespace — a prefix that disambiguates: tei:body
命名空间：用前缀消除歧义：tei:body

Why epigraphy uses XML: editorial markup needs more than plain text. You need to mark "this letter was supplied by the editor" or "the date is between 117 and 138 CE with low confidence." Plain text cannot do this; XML can.

铭文学为何选 XML：编辑标记需要的不只是纯文本。你需要标"这一字母为编辑所补"，或"年代为公元 117–138 年，低置信度"。纯文本做不到，XML 可以。

Note for low-coding-literacy readers: XML files are just text files. You can open them in any text editor (Notepad, TextEdit, VS Code). The angle brackets are not magic — they are just a convention so a computer can find where one piece of information ends and the next begins.

给非程序员的提示：XML 文件就是纯文本文件，任何文本编辑器（记事本、TextEdit、VS Code）都能打开。尖括号没有魔法，只是一种约定，让计算机知道一段信息从哪开始、到哪结束。

Foundations · 基础 · 2/6

What are JSON, CSV, and tabular data?

什么是 JSON、CSV 与表格型数据？

CSV

Comma-Separated Values. The simplest tabular format: one row per line, columns separated by commas. Excel and pandas read it natively.

逗号分隔值。最简单的表格格式：一行一条记录，列用逗号分隔。Excel 和 pandas 都能直接读。

id,province,date_from,date_to
EDCS-22000882,Sicilia,-100,100
EDCS-22000883,Africa,150,200

JSON

JavaScript Object Notation. Records as key-value dictionaries; nested allowed. Good for APIs and structured records that don't fit a flat table.

JavaScript 对象表示法。记录写成"键-值"字典，可嵌套。适合 API 与"摆不平为表格"的结构化记录。

{
  "id": "EDCS-22000882",
  "province": "Sicilia",
  "date": {"from": -100, "to": 100},
  "text": ["Tituli heic", "στῆλαι ἐνθάδε"]
}

Parquet

A binary tabular format optimised for analysis at scale. SDAM's distribution format. Loads ~10× faster than CSV; readable in pandas, Arrow, R.

为大规模分析优化的二进制表格格式。SDAM 的分发格式。比 CSV 加载快约 10 倍；pandas、Arrow、R 都能读。

import pandas as pd
df = pd.read_parquet(
   'LIST_v0-3.parquet'
)
df.shape
# (~600000, 50)

The trade-off vs XML: CSV/JSON/Parquet are flatter than XML. They lose nested editorial markup but gain massive speed and direct loadability into analysis tools. Most "data analysis on inscriptions" works on a CSV/Parquet derivative; the original TEI XML stays as ground truth.

与 XML 的取舍：CSV / JSON / Parquet 比 XML 扁平。它们丢失了嵌套的编辑标记，但获得了极快的加载速度，能直接进分析工具。绝大多数"对铭文做的数据分析"实际是在 CSV / Parquet 衍生版上跑的；TEI XML 原件留作"真值"。

Real-world example: the SDAM ETL pipeline takes EDCS's web records, parses them, runs regex cleaning, and outputs EDCS_merged_cleaned_attrs_2022-09-12.json (537,262 records, 50 fields). That JSON dump is what every downstream analysis loads. The TEI XML remains in I.Sicily / IRT / IRCyr GitHub repos for any reader who wants to verify the pipeline's choices.

真实例子：SDAM ETL 管道抓取 EDCS 网页记录、解析、跑正则清洗，输出 EDCS_merged_cleaned_attrs_2022-09-12.json（53.7 万条、50 字段）。下游所有分析都是基于这份 JSON。TEI XML 原件保留在 I.Sicily / IRT / IRCyr 的 GitHub 仓库中，供想核查管道决策的读者使用。

Foundations · 基础 · 3/6

What is a schema? — RNG, DTD, Schematron

什么是 schema？RNG、DTD、Schematron

A schema is a contract that says: "an XML document is valid if and only if it follows these rules." It defines which elements may appear, in what order, with which attributes, and with which content. Without a schema, two editors will quietly use different conventions and the data will not be queryable.

schema（模式）是一份契约："此 XML 文档若且仅若遵守这些规则才算有效"。它规定：哪些元素可以出现、以什么顺序、带哪些属性、内容是什么。没有 schema，两位编辑会暗中采用不同惯例，整套数据就无法查询。

Schema language模式语言	Used for用途	In epigraphy在铭文学中
DTD (Document Type Definition)	oldest; defines elements + attributes	Original TEI EpiDoc DTD (P4) — superseded but still seen in older files
RNG (RELAX NG)	more expressive; richer constraints	The current TEI EpiDoc schema. `tei-epidoc.rng`
Schematron	rule-based assertions ("if X, then Y must…")	Project-specific checking rules. e.g. `ircyr-checking.sch`
XSD (XML Schema)	W3C standard; strong typing	Used by some library systems, less common in epigraphy

What you actually see at the top of an EpiDoc file

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/dev/tei-epidoc.rng"
             schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/dev/ircyr-checking.sch"
             schematypens="http://purl.oclc.org/dsdl/schematron"?>

Reading this: the file declares two schemas. The first (RNG) says "this file follows the EpiDoc structure." The second (Schematron) says "this file additionally follows the IRCyr project's checking rules." A validator runs both — RNG catches structural errors; Schematron catches "you used a Cyrene placeRef but the geographic region must then be Cyrenaica."

读法：文件声明了两份 schema。第一份（RNG）说"本文件遵循 EpiDoc 结构"；第二份（Schematron）说"本文件额外遵循 IRCyr 项目的检查规则"。校验器会同时跑两者，RNG 抓结构错误；Schematron 抓"你用了 Cyrene 的 placeRef，那么地理区域必须是 Cyrenaica"这类业务规则。

Why this matters: EpiDoc XML across different projects (I.Sicily, IRT, IRCyr) is interoperable because they all share the same RNG schema. Project-specific Schematron rules add local discipline without breaking that interoperability.

这为何要紧：不同项目（I.Sicily、IRT、IRCyr）的 EpiDoc XML 之所以可互操作，是因为它们共用同一份 RNG schema。项目自有的 Schematron 规则在不破坏互操作性的前提下，叠加了"本地纪律"。

Foundations · 基础 · 4/6

URI, URL, handle, DOI, ARK — names that should not break

URI、URL、handle、DOI、ARK，不该断的名字

A digital scholarly resource is only as durable as the name you cite it by. Different naming systems trade off readability, persistence, and institutional commitment.

数字学术资源的可持续性，取决于你引用它时所用的名字。不同的命名系统在"可读性""持久性""机构承诺"之间各有取舍。

System系统	Example例子	Strength优点	Weakness弱点
URL (a web address)	`http://sicily.classics.ox.ac.uk/inscription/ISic000470`	readable, clickable	breaks if the server moves; link rot is the rule
URI (broader concept)	any URL or URN	universal address	does not by itself promise resolution
Handle	`hdl.handle.net/1811/99008`	institutional commitment to keep resolving	requires institution to maintain handle service
DOI	`10.1515/jdh-2021-1004`	publisher-backed, citable in scholarship	cost to register; mostly publishers, not databases
ARK	`ark:/12345/foo`	library-grade, no fees	less common in epigraphy than libraries
URN:CTS	`urn:cts:greekLit:tlg0057.tlg010`	canonical citation for Greek/Latin literature	specific to the canonical-greekLit / Scaife world
TM-ID	`TM 79368`	universal join key for ancient world	not a URL by itself; resolves at trismegistos.org
Pleiades URI	`pleiades.stoa.org/places/462492`	canonical place identifier	limited to Mediterranean places

For your own work: when citing an inscription in a paper, prefer the most stable identifier the database offers. For TEI EpiDoc projects, this is usually a project URI like http://sicily.classics.ox.ac.uk/inscription/ISic000470. For records that descend from print, also cite the print edition (CIL X 7296) — print citations remain stable when websites disappear.

对你自己的工作：论文中引用铭文时，优先选数据库提供的最稳定标识符。对 TEI EpiDoc 项目通常是 http://sicily.classics.ox.ac.uk/inscription/ISic000470 这样的项目 URI。对源自印本的记录，同时引印本（CIL X 7296）—— 网页消失时，印本引用仍稳。

The honest reality: a 2018 study found that ~10–15% of cited URLs in classical-studies articles are dead within 5 years. Schemes like DOI and handle exist to push that rate down — but they cost money and require institutional maintenance.

诚实的现实：2018 年一项研究发现，古典学论文中所引 URL 在 5 年内"失效"的比例约为 10–15%。DOI、handle 等方案就是为压低这个比例而存在，但需要持续的资金与机构维护。

Foundations · 基础 · 5/6

What is an API? — REST, OAI-PMH, SPARQL

什么是 API？REST、OAI-PMH、SPARQL

An API (Application Programming Interface) is a way for one program to ask another for data, structured so the answer comes back in a predictable format. Most epigraphy databases offer one or more APIs alongside their human-readable web pages.

API（应用程序接口）是一个程序向另一个程序请求数据的方式，请求与响应都按可预测的格式来。大多数铭文数据库都在面向人类的网页之外提供一个或多个 API。

REST + JSON

The most common API style. You send an HTTP GET to a URL; the server returns JSON. Used by I.Sicily, papyri.info, OSU Knowledge Bank.

最常见的 API 风格。你向某个 URL 发 HTTP GET，服务器返回 JSON。I.Sicily、papyri.info、OSU 知识库都用这种。

GET https://kb.osu.edu/server/api/core
    /items/88887477-ad90-4c87-b9dd-…

→ {"name": "CIL 10.7296/IG 14.297…",
   "metadata": {…},
   "_links": {…}}

OAI-PMH

An older, library-grade API for harvesting full record sets in batches. Used by some archives and DSpace repositories.

较早的、图书馆级 API，用于按批量收割整套记录。一些档案库与 DSpace 知识库使用。

GET …/oai-pmh?verb=ListRecords
&metadataPrefix=oai_dc

→ <ListRecords>
    <record>…</record>
    <record>…</record>
  </ListRecords>

SPARQL endpoint

A query language for linked data. You write a graph pattern and the server returns matching subjects. Pleiades and a few EAGLE-aware projects expose SPARQL.

关联数据的查询语言。你写一段图谱模式，服务器返回匹配项。Pleiades 与几个 EAGLE 感知项目提供 SPARQL endpoint。

SELECT ?place ?name WHERE {
  ?place rdfs:label ?name .
  ?place pleiades:type "settlement" .
  ?place geo:in_region pleiades:Sicily .
}

Direct download

Many projects skip APIs entirely and publish bulk dumps as zip files on GitHub or institutional repositories. Less elegant, but reliably forever-archived.

许多项目干脆跳过 API，把批量数据以 zip 形式发布到 GitHub 或机构知识库。不够优雅，但更可靠地"永久存档"。

git clone https://github.com/
       ISicily/ISicily

# → 3,500 TEI XML files,
#   one per inscription

For SDAM and similar projects: the ETL pipelines combine all of these. They scrape HTML where there's no API, hit REST endpoints where one exists, harvest OAI-PMH from DSpace, and pull GitHub bulk dumps where projects publish them. No single API handles the field; ETL stitches the methods.

对 SDAM 等项目而言：ETL 管道把以上方法结合使用。无 API 处抓取 HTML，有 REST 处调 REST，DSpace 处用 OAI-PMH 收割，凡 GitHub 公开批量数据处直接拉取。没有任何一种单一 API 能覆盖整个领域；ETL 才是把它们缝合起来的工艺。

Foundations · 基础 · 6/6

What is a controlled vocabulary?

什么是受控词表？

A controlled vocabulary is a fixed list of allowed values for a field. Instead of letting an editor type "limestone" or "Limestone" or "Lst." or "lapis", you say: "the allowed values for material are this list of 14 lithic types." Queries become reliable; spelling variants stop polluting counts.

受控词表是某字段允许取值的固定列表。与其让编辑随手写 "limestone"、"Limestone"、"Lst." 或 "lapis"，不如规定："字段 material 的允许取值是 下列 14 种岩性。"于是查询变得可靠，拼写变体不再污染计数。

EpiDoc taxonomies

<classDecl><taxonomy>
  <category xml:id="photograph">
    <catDesc>Digital or digitized
            photographs</catDesc>
  </category>
  <category xml:id="representation">
    <catDesc>Digitized other
            representations</catDesc>
  </category>
</taxonomy></classDecl>

The IRCyr project declares its own taxonomy of figure types inside the file's encodingDesc.

IRCyr 项目在文件 encodingDesc 内声明自有的"图像类型分类"。

EAGLE Vocabularies

A federation-wide effort. Five shared vocabularies (object types, materials, decorations, dating criteria, writing types), each with multilingual labels. Reused across EDR, EDB, EAGLE network records.

联盟级协作。五套共享词表（对象类型、材质、装饰、定年判据、书写类型），每套都有多语标签。被 EDR、EDB、EAGLE 网络记录共用。

e.g. http://www.eagle-network.eu/voc/objtyp/lod/93 = "altar"

如 http://www.eagle-network.eu/voc/objtyp/lod/93 = "altar"（祭坛）

Field字段	Common controlled vocabularies常见受控词表
material	EAGLE materials, IRCyr materials, project-local lists
object type	EAGLE object-types, AAT (Getty's Art & Architecture Thesaurus)
place	Pleiades (~40,000 ancient places), GeoNames (modern)
language	ISO 639 codes (`la`, `grc`, `xpu`)
script	ISO 15924 codes (`Latn`, `Grek`)
person	VIAF (Virtual International Authority File), Lexicon of Greek Personal Names (LGPN)
execution	Stoa execution-types vocabulary (`scalpro`, `tessellis`, `graffito`…)

For users of the data: when joining records across databases, the join is most reliable on controlled-vocabulary fields. EAGLE's vocabularies are why an EDR record can be lined up with an EDB record. Free-text fields ("commentary," "notes") cannot be joined that way.

对数据使用者：跨数据库联接时，受控词表字段最可靠。EAGLE 词表正是 EDR 记录能与 EDB 记录对齐的原因。自由文本字段（"commentary"、"notes"）则无法这样联接。

The taxonomy · 分类

Six families — read this, then the rest of the atlas tiles in

六大家族，读完这页，全图就立起来了

Click any tile to jump to that family's detailed slide.

点击任一卡片，跳到该家族的详细幻灯。

Family 1 → Aggregators聚合器

Pull records from many sources, present a unified search. Wide coverage, shallow encoding.

从多源汇集，提供统一检索。覆盖广，编码浅。

e.g. EDCS, EDH, PHI · slide #f1-aggregators如 EDCS、EDH、PHI · 幻灯 #f1-aggregators

Family 2 → Regional EpiDoc projects区域 EpiDoc 项目

A fixed corpus, deeply marked-up in TEI XML, often by a single editorial team.

固定语料库，以 TEI XML 深度编码，多由单一编辑团队负责。

I.Sicily · IRT · IAph · IRCyr · IGCyr · IOSPE · slide #f2-regionalI.Sicily · IRT · IAph · IRCyr · IGCyr · IOSPE · 幻灯 #f2-regional

Family 3 → Text-only Greek databases纯文本希腊数据库

Find-the-Greek-string services. No images, minimal metadata. Strong on recall, weak on context.

"找希腊文字串"型服务。无图、元数据极少。检索召回率高，语境贫瘠。

e.g. PHI, McCabe corpora · slide #f3-text-only如 PHI、McCabe 语料 · 幻灯 #f3-text-only

Family 4 → Italian-language federation意大利语联盟

EAGLE-network partners covering Italian regions. Italian metadata, photographs, partner-IDs.

EAGLE 联盟下的意大利语合作伙伴。意大利语元数据、照片、合作 ID。

EDR · EDB · EDV · EDH-Italia · slide #f4-italianEDR · EDB · EDV · EDH-Italia · 幻灯 #f4-italian

Family 5 → Papyrology cousins莎草纸学表亲

Same TEI EpiDoc family, different material. Sharp ID systems, especially TM and HGV.

同属 TEI EpiDoc 家族，但介质是莎草纸。ID 系统犀利，尤以 TM 与 HGV 为代表。

DDbDP · HGV · APIS · papyri.info · slide #f5-papyriDDbDP · HGV · APIS · papyri.info · 幻灯 #f5-papyri

Family 6 → ML / NLP datasets机器学习数据集

Re-encoded for model training. Tokenized text, no images, train/val/test splits.

为模型训练重新编码。已切分 token，无图像，分训练／验证／测试集。

iPHI / Ithaca · PHI-Latin · slide #f6-mliPHI / Ithaca · PHI-Latin · 幻灯 #f6-ml

How they relate: Family 2 produces the deepest records; Family 1 swallows them and re-publishes with shallow joins; Family 3 ingests text only; Family 4 mirrors Family 1 in Italian; Family 5 is a parallel TEI EpiDoc world for papyri; Family 6 distills text for ML training.

彼此关系：家族二产出最深的记录；家族一吞下后浅联接重发；家族三只取文本；家族四是家族一的意大利语镜像；家族五是莎草纸学的 TEI EpiDoc 平行宇宙；家族六把文本蒸馏为 ML 训练数据。

Family 1 · 家族一 — Aggregators · 聚合器

The wide-coverage databases

广覆盖数据库

EDCS

name

Epigraphik-Datenbank Clauss-Slaby

scope

any Latin inscription

records

~537,000

encoding

flat record; raw + conservative + interpretive cleaning fields

images

~26.3 % of records have a photo URL

licence

site terms; ETL bulk dumps via SDAM partners

link

edcs.hist.uzh.ch ↗

EDH

name

Epigraphic Database Heidelberg

scope

Roman provinces (excl. Italy)

records

~82,000

encoding

structured fields; clean / diplomatic / transcription

images

very high — photographs & squeezes

licence

CC BY-SA

link

edh.ub.uni-heidelberg.de ↗

UBI ERAT LUPA

name

Ubi Erat Lupa — image database for Roman stone monuments

scope

Roman stone monuments (mostly funerary, votive, sculptural)

records

~50,000+

encoding

image-first; structured metadata around the picture

images

core asset of the project

licence

site terms; image rights restricted

link

lupa.at ↗

What aggregators are good at: "find me all Latin inscriptions that mention X" — recall-first, breadth-first. What they are not: a faithful witness to any one stone. The deeper you go on a single record, the more you should be reading the original publication, not the aggregator's row.

聚合器擅长的："给我所有提到 X 的拉丁铭文"，召回率优先、广度优先。它们不擅长的：对任何单一石头的忠实见证。越深入单条记录，就越应该回到原始出版物，而不是聚合器的那一行。

See also 另见

DB Literacy · Image-question slide · 26.3% empirical EDCS photo coverageEDCS 实证 26.3% 照片覆盖率
DB Literacy · EDH absence · why a Sicilian stone falls outside EDH's scope为何一块西西里石头不在 EDH 范围内
Visual · Extract slides · how SDAM ETL pulls from EDH and EDCSSDAM ETL 如何从 EDH/EDCS 抓取
Reference · EDH_ETL + EDCS_ETL repos · the actual scrapers实际的抓取器

Family 2 · 家族二 — Regional EpiDoc · 区域 EpiDoc

The deep, narrow corpora

深而窄的语料库

Each project covers one region or one corpus, in TEI EpiDoc XML, with rich editorial metadata. Most are in the King's College London family (or its diaspora) and share a common structural vocabulary.

每个项目覆盖一个地区或一种语料，以 TEI EpiDoc XML 编码，附丰厚的编辑元数据。多数项目出自伦敦国王学院家族（及其海外分支），共享一套结构化词表。

Project项目	Coverage范围	~records条数	Languages语言	Licence
I.Sicily	Sicily	~3,500	Greek, Latin, Punic, Hebrew	CC BY 4.0
IRT	Roman Tripolitania	~1,618	Latin, Greek, Punic, Neo-Punic	CC BY 2.0 UK
IRCyr	Roman Cyrenaica (Libya)	~2,360	Greek, Latin	CC BY 2.0 UK
IGCyr / GVCyr	Greek Cyrenaica	~1,500	Greek	CC BY-NC 4.0
IAph 2007	Aphrodisias	~1,489	Greek, Latin	CC BY 2.5
IOSPE	Northern Black Sea Coast	varies (multi-volume)	Greek, Russian metadata	CC BY 4.0
InsAph / Vindolanda	Vindolanda tablets	~700	Latin (cursive)	CC BY-NC 4.0
AshLI	Ashmolean Latin inscriptions	~250	Latin, Greek	CC BY-SA
MAMA XI	Phrygia & Lykaonia	~400	Greek, Latin	CC BY-NC-SA
EpiDoc First1KGreek	various Greek-corpus projects	varies	Greek	CC BY 4.0

What they share: the TEI EpiDoc schema; structured fields for material, dimensions, find-spot, dating, transcription with editorial certainty markers; bibliography and apparatus criticus. Records are typically one XML file per inscription, hosted on GitHub, citable by stable URI.

它们共享的是：TEI EpiDoc 模式；为材质、尺寸、出土地、定年、附"编辑确定性"标记的转写而设的结构化字段；参考文献与异文校勘记。记录通常是"一条铭文 = 一份 XML"，托管在 GitHub，可通过稳定 URI 引用。

See also 另见

↓ msDesc deep-dive · physDesc · layoutDesc · history · certainty · entities · apparatus · revisionDesc
↓ Case study: IRCyr A.1 Noah caption · ↓ IGCyr 002200 archaic graffito · ↓ IRT0001 sample · ↓ IAph 010001 lemma tagging · ↓ IOSPE bilingual encoding
Case Study (full edition) · I.Sicily ISic000470 walked across 5 databasesI.Sicily ISic000470 在五库间对照
DB Literacy · I.Sicily vs EDCS panel · side-by-side, what each preserves并排：各自保留什么

EpiDoc deep-dive · 1/8

The `<msDesc>` block — how the stone is described

`<msDesc>` 区块，石头是如何被描述的

msDesc ("manuscript description," but used for inscribed objects too) is the spine of every EpiDoc record. It nests four blocks:

msDesc（"manuscript description"，写本描述，但也用于铭文对象）是每条 EpiDoc 记录的主干。它嵌套四个区块：

<msDesc>
  <msIdentifier>       ← where the object lives now (museum, collection, inv. number)
    <repository ref="institution.xml#db932">Apollonia Museum</repository>
  </msIdentifier>
  <msContents>          ← what kind of text is on it (a one-line summary)
    <summary><seg>Dedication.</seg></summary>
  </msContents>
  <physDesc>            ← physical: material, dimensions, layout, lettering
    <objectDesc> ... </objectDesc>
    <handDesc> ... </handDesc>
  </physDesc>
  <history>             ← origin (when/where made), provenance (when/where seen)
    <origin> ... </origin>
    <provenance type="found"> ... </provenance>
    <provenance type="observed"> ... </provenance>
  </history>
</msDesc>

What you can ask, given this structure: "show me all inscriptions on limestone, found at Cyrene, between 100–300 CE, currently held at the Apollonia Museum." Each clause maps to one nested element. Without msDesc, this query would be impossible without the editor doing manual filtering.

有了这套结构，你能问："列出所有材质为石灰岩、出土于昔兰尼、年代在 100–300 CE 之间、当前藏于 Apollonia 博物馆的铭文。"每一个子句都映射到一个嵌套元素。没有 msDesc，这种查询离不开编辑的人工筛选。

Notice the ref="institution.xml#db932": repository names are not free text — they refer to a project-internal authority file. This is how EpiDoc projects keep "Apollonia Museum" written exactly the same way across thousands of records. Other commonly externalized lists: execution.xml, findspot.xml, material-search.xml.

注意 ref="institution.xml#db932"：repository 名不是自由文本，而是引用项目内部的权威文件。这正是 EpiDoc 项目能在几千条记录中把"Apollonia Museum"写得完全一致的方法。其他常被外置的列表：execution.xml、findspot.xml、material-search.xml。

EpiDoc deep-dive · 2/8

`<physDesc>` — material, dimensions, lettering

`<physDesc>`，材质、尺寸、字形

A real physDesc from IRCyr's record C24400 (a fragmentary building inscription from the temple of Artemis at Cyrene):

来自 IRCyr 记录 C24400（昔兰尼阿尔忒弥斯神庙的一段残断建筑铭文）的真实 physDesc：

<physDesc>
  <objectDesc>
    <supportDesc>
      <support>
        <p><material>Limestone</material> block with moulding above,
           from a Doric <objectType>entablature</objectType>,
           broken in two and damaged at the lower right corner
           (<dimensions>
             <width>1.03</width>
             <height>0.59</height>
             <depth precision="low">0.44</depth>
           </dimensions>).</p>
      </support>
    </supportDesc>
    <layoutDesc>
      <layout><rs type="execution" key="scalpro">Inscribed</rs> on one face.</layout>
    </layoutDesc>
  </objectDesc>
  <handDesc>
    <handNote>Second century: <height>0.07</height>; ligature in line 2</handNote>
  </handDesc>
</physDesc>

What this records:

记录了什么：

material = Limestone (could be controlled vocab)
material = 石灰岩（可由受控词表）
objectType = entablature, with prose context "from a Doric…"
objectType = 檐部，附上下文"从一座多立克式…"
dimensions = 1.03m × 0.59m × 0.44m. The depth has precision="low" — the editor was uncertain.
dimensions = 1.03m × 0.59m × 0.44m。深度带 precision="low"，编辑者不太确定。
execution with key="scalpro" — a controlled-vocab term (sculpted/incised). External taxonomy.
execution 带 key="scalpro"，受控词表中的"刻凿"项。外置词表。
letter height = 0.07m, plus a free-text note about ligatures.
字母 height = 0.07m，加自由文本备注（连字）。

Compare to EDCS for the same kind of object: EDCS would record material as a single string like "lapis", dimensions are absent entirely, and lettering height is absent. The depth + breadth trade-off again — IRCyr knows much more per record but is much smaller.

与 EDCS 对照：EDCS 会把材质记成一串 "lapis"；尺寸字段完全没有；字母高度也无。深 vs 广的取舍再次显形，IRCyr 单条记录懂得多得多，但总条数小得多。

EpiDoc deep-dive · 3/8

`<layoutDesc>` — where the letters sit on the surface

`<layoutDesc>`，字母在表面的位置

"Layout" in epigraphy means: how is the inscribed text spatially arranged on the object? Is it in one block or several? Is there an iconographic field next to it? Are there frames, dividers, vacant spaces? IRCyr's record A00300 (a Christian mosaic caption for an image of Noah) shows what this looks like:

铭文学中的"layout"指：刻文如何在对象表面空间分布？是一段还是数段？旁边有无图像区？是否有边框、分隔、空白？IRCyr 记录 A00300（"诺亚释鸽"马赛克的希腊文 caption）展示了这是什么样：

<layoutDesc>
  <layout>
    <rs type="execution" key="tessellis">Mosaic</rs> lettering, in a panel
    representing <rs type="iconography">Noah</rs> as he releases the
    <rs type="iconography">dove</rs> from the <rs type="iconography">Ark</rs>.
  </layout>
</layoutDesc>

The layered information: execution="tessellis" = "made of mosaic tesserae" (controlled vocabulary). Three iconographic fields tagged: Noah, dove, Ark. Each iconography tag is a hook for cross-image searches: "show me all Christian mosaics in IRCyr that depict the dove."

分层信息：execution="tessellis" = "由马赛克小方砖拼成"（受控词表）。三个图像学标签被打上：诺亚、鸽子、方舟。每个 iconography 标签都是一条跨图像检索的"钩子"：可问"列出 IRCyr 中所有描绘鸽子的基督教马赛克"。

What no aggregator captures: the relation between text and image. PHI strips the iconographic context entirely. EDCS records mosaic inscriptions as plain text. Only EpiDoc projects keep the spatial-iconographic field. If you want to study how Christian inscriptions were embedded in their visual programmes, you must work in TEI EpiDoc.

聚合器从不捕获的：文字与图像的关系。PHI 把图像语境彻底剥离；EDCS 把马赛克铭文记成纯文本。只有 EpiDoc 项目保留了"空间-图像场"这一字段。若你想研究基督教铭文如何嵌入其视觉方案中，就必须在 TEI EpiDoc 上工作。

EpiDoc deep-dive · 4/8

`<history>` — when made, when found, where now

`<history>`，何时所作、何时被发现、现存何处

Inscriptions have biographies. They are made at one time and place; they are found later, sometimes much later; they are then moved (often more than once) and may be on display, in storage, or lost. EpiDoc records this with three sub-elements inside <history>:

铭文有自己的"传记"。它在某一时间地点被造出，后来被发现（往往晚得多），又被搬动（不止一次），如今可能在展厅、库房，或失踪。EpiDoc 在 <history> 中以三个子元素记录这一切：

<history>
  <origin>             ← when and where made
    <origPlace ref="...">Findspot.</origPlace>
    <origDate notBefore="0117" notAfter="0138"
              precision="low" evidence="reign">C.E. 117-138</origDate>
  </origin>
  <provenance type="found">     ← when and where excavated/observed
    <p><placeName type="ancientFindspot"
        ref="https://www.slsgazetteer.org/909">Cyrene</placeName>:
       <placeName type="monuList"
        ref="https://www.slsgazetteer.org/1051">Temple of Artemis</placeName>;
       recorded in <date type="found">1861</date>.</p>
  </provenance>
  <provenance type="observed">   ← currently visible at this location
    <p>Temple of Artemis: standing on the wall to the right of the entrance.</p>
  </provenance>
</history>

What this lets you ask: "show me all inscriptions found at Cyrene before 1900, currently lost." (You filter provenance/@type="found" with date < 1900, but find no provenance/@type="observed".) Or: "match origDate to provenance: how many years between an inscription's making and its rediscovery?" Possible because origin and provenance are encoded as separate, structured elements.

能问的问题："列出昔兰尼出土、年代在 1900 年以前、现已失踪的铭文。"（筛 provenance/@type="found" 且日期 < 1900 而无 provenance/@type="observed"。）或："把 origDate 对照 provenance：刻成与重见天日之间相隔多少年？"之所以能问，是因为 origin 与 provenance 各自是独立的结构化元素。

Aggregator equivalent: EDCS has a single place field that conflates origin and findspot. EDH separates them better. Only TEI EpiDoc keeps the full biography.

聚合器对应：EDCS 用一个 place 字段，把"刻作处"与"出土处"混在一起。EDH 分得稍好。只有 TEI EpiDoc 保留了完整的传记。

EpiDoc deep-dive · 5/8

Certainty markers — honest about what we don't know

置信度标记，对未知保持诚实

EpiDoc treats uncertainty as a first-class citizen. Almost every dating, attribution, and reading attribute can carry confidence markers, so a downstream analysis can either filter on them or propagate them.

EpiDoc 把"不确定性"作为一等公民对待。几乎每一个定年、归属、释读属性都可携带置信度标记，下游分析可据此过滤或传播。

Attribute属性	What it marks标的对象	Example示例
`cert="low"`	low confidence in this value对此值置信度低	`<persName cert="low">Apollinaris</persName>`
`precision="low"`	value is approximate, not exact值为近似而非精确	`<depth precision="low">0.44</depth>`
`evidence="…"`	on what basis was this attributed以什么依据归属	`evidence="lettering"` / `"reign"` / `"context"`
`resp="#jmr"`	who is responsible for this judgement此判断由谁负责	`<date resp="#jmr">…</date>` (J. M. Reynolds)
`notBefore` / `notAfter`	date range bounds年代区间端点	`notBefore="0117" notAfter="0138"`
`atLeast` / `atMost`	measurement bounds测量值区间	`<height atLeast="0.003" atMost="0.006">0.003-0.006</height>`
`match=".."`	scope of the certainty置信度的作用域	`<certainty cert="low" locus="value" match=".."/>`

Why this drives method: the JDH 2021 paper's tempun toolkit propagates date uncertainty by Monte-Carlo-sampling each inscription's date range many times — only possible because notBefore and notAfter are separately encoded. If you collapse the date to a single number, you lose the basis for honest uncertainty propagation.

这如何驱动方法：JDH 2021 中的 tempun 工具包通过对每条铭文的年代区间做蒙特卡洛抽样来传播不确定性，这件事之所以可能，是因为 notBefore 与 notAfter 各自被独立编码。若把日期塌成一个数，就失去了"诚实地传播不确定性"的基础。

EpiDoc deep-dive · 6/8

Named entities — people, places, gods

命名实体，人、地、神

EpiDoc tags every proper name in the transcription, classifying them and linking to authority files. This makes "all inscriptions mentioning emperor Hadrian" a structured query rather than a string-search.

EpiDoc 把转写中的每一个专名都打上标签、分类、并链到权威文件。这让"所有提到哈德良元首的铭文"成为一项结构化查询，而非字符串搜索。

Element元素	What it marks标的对象	Subtype examples子类型示例
`<persName>`	a person人名	`type="emperor"` / `"divine"` / `"attested"` / `"aphrodisian"`
`<placeName>`	a place地名	`type="ancientFindspot"` / `"modernFindspot"` / `"monuList"`
`<orgName>`	an organisation组织名	e.g. a guild, college, civic body
`<name>`	a single name (within a persName)一段单独的名字（嵌于 persName 内）	`type="surname"` / `type="forename"`
`<geogName>`	a geographical region地理区域	`type="ancientRegion" key="Cyrene"`
`<rs>`	"referencing string" — generic semantic tag"指涉串"，通用语义标签	`type="iconography"` / `"textType"` / `"execution"`

A real example from IAph 010001 (a Greek inscription mentioning the Roman dictator Caesar and the goddess Aphrodite):

真实示例（IAph 010001，一段希腊铭文，提到罗马独裁者凯撒与阿芙罗狄忒女神）：

<persName type="emperor">
  <name reg="Καῖσαρ"><supplied reason="lost" cert="low">Καῖσαρ</supplied></name>
</persName>
...
<persName type="aphrodisian">
  <name reg="Γάϊος"><supplied reason="lost">Γάϊος</supplied></name>
  <name reg="Ἰούλιος"><supplied reason="lost">Ἰούλιος</supplied></name>
  <name reg="Ζωΐλος"><supplied reason="lost">Ζωΐλος</supplied></name>
</persName>
...
<persName type="divine">
  <name reg="Ἀφροδίτη"><supplied reason="lost">Ἀφροδείτης</supplied></name>
</persName>

The reg attribute normalises the surface form to a "regularised" form for searching. So reg="Ἀφροδίτη" on a stone that actually has Ἀφροδείτης means "this is the goddess Aphrodite, normalised." Now you can ask "all inscriptions mentioning Aphrodite" and get hits even where the spelling varies.

reg 属性把表面形归一化为标准形以便检索。如石面写 Ἀφροδείτης，但 reg="Ἀφροδίτη" 表示"这是阿芙罗狄忒女神的归一形"。于是可问"所有提到阿芙罗狄忒的铭文"，即使拼法不同也能命中。

EpiDoc deep-dive · 7/8

The apparatus criticus — recording editorial disagreements

异文校勘记，记录编辑间的分歧

An ancient inscription is rarely transcribed identically by every editor who has read it. The apparatus criticus is the place where competing readings are recorded — Editor A reads x, Editor B reads y, the present editor follows A. EpiDoc represents this as a special <div type="apparatus"> block.

同一块铭文很少能让所有读过它的编辑给出完全一致的转写。"异文校勘记"就是用来记录互相竞争的释读，编辑 A 读 x，编辑 B 读 y，本版从 A。EpiDoc 用一个特殊的 <div type="apparatus"> 区块来呈现。

<div type="apparatus">
  <p>Line 2: <app>
    <lem resp="#jmr">Apolli<supplied reason="lost">naris</supplied></lem>
    <rdg resp="#jbwp">Apolli<supplied reason="lost">nius</supplied></rdg>
  </app>.</p>
  <p>For the supplements, compare the partner inscription
     <ref type="inscription" n="1581" href="010038">1.38</ref>.</p>
</div>

Reading this: in line 2, the present editor (Joyce M. Reynolds) reads "Apollinaris"; an alternative editor (J. B. Ward-Perkins) read "Apollinius." The <app> tag wraps the disputed passage; <lem> ("lemma") is the chosen reading; <rdg> ("reading") is each alternative. Each carries resp="…" linking to an editor's xml:id.

读法：第 2 行中，本版编辑（Joyce M. Reynolds）读为 "Apollinaris"；另一编辑（J. B. Ward-Perkins）读为 "Apollinius"。<app> 标签包住有争议段落；<lem>（lemma 当选释读）；<rdg>（reading 备选释读）。每一个都带 resp="…"，链到某编辑的 xml:id。

What this enables: "show me all inscriptions where two editors disagree on a personal name." That query is impossible against EDCS or PHI (which collapse the apparatus into the main text or omit it entirely). It is a single XPath against EpiDoc XML.

由此能做什么："列出所有'两位编辑对人名释读不一致'的铭文。"此查询在 EDCS 或 PHI 上无解（它们把校勘并入正文或干脆删除）。在 EpiDoc XML 上则只是一行 XPath。

EpiDoc deep-dive · 8/8

`<revisionDesc>` — the digital record's biography

`<revisionDesc>`，数字记录自身的传记

Just as the stone has a history, the digital record about it has one too. Every meaningful change to a TEI file is logged with date, editor, and a short note. This is editorial transparency at its most basic: any reader can see who decided what, when.

石头有历史，关于它的数字记录也有。TEI 文件每一次有意义的修改，都附日期、编辑者、简短说明。这是最基本的编辑透明：任何读者都能看清"谁、何时、做了什么决定"。

<revisionDesc>
  <change when="2021-02-20" who="CMR">adapted</change>
  <change when="2013-08-22" who="TW">tagged words, names and placenames</change>
  <change when="2012-11-05" who="GB">moved text-constituted-from to profileDesc/creation</change>
  <change when="2012-10-16" who="GB">broke provenance events into multiple provenance elements</change>
  <change when="2010-08-18" who="GB">Converted from TEI P4 (EpiDoc DTD v. 6) to P5 (EpiDoc RNG schema v. 8)</change>
  <change when="2009-08-24" who="RV">Added Figures</change>
  <change when="2008-09-09" who="ZA">converted using CHET-C</change>
</revisionDesc>

What you can read out of this: the IRT0001 record was originally created in TEI P4 (older), converted to P5 in 2010, had figures added in 2009, broke its provenance into separate elements in 2012, gained word/name/place tagging in 2013, and was last adapted in 2021. The digital record's history is itself a research object — encoding fashions changed; this file evolved with them.

能读出的信息：IRT0001 记录最初以 TEI P4（旧版）创建，2010 年转为 P5；2009 年增图；2012 年把 provenance 拆为独立元素；2013 年加上词/人名/地名标注；2021 年最后一次调整。数字记录自身的历史本身就是一项研究对象，编码风潮变迁，本文件随之演进。

For citation: when citing a TEI EpiDoc record, modern practice is to include the last revision date alongside the URI. The data may have changed since you first read it; the revision-history makes that explicit.

引用时：引用一份 TEI EpiDoc 记录，现代规范是把 最近修订日期 与 URI 一并列出。读到之后数据可能已变；revision-history 把这件事写明。

Family 2 · sample · 实例

A regional EpiDoc record up close — IRT0001

看一份区域 EpiDoc 记录，IRT0001

The Inscriptions of Roman Tripolitania (IRT) project: 1,618 records from Roman-period Libya. Each is a TEI XML file with this structure:

罗马时期 Tripolitania 铭文项目（IRT）：1,618 条来自利比亚罗马时期的铭文。每条是一份 TEI XML，结构如下：

<TEI xml:id="IRT0001" xml:lang="en">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Reference to Apollo (?)</title>
        <editor xml:id="jmr">J. M. Reynolds</editor>
      </titleStmt>
      <publicationStmt>
        <publisher>Society for Libyan Studies</publisher>
        <idno type="filename">IRT0001</idno>
        <availability><p>CC BY 2.0 UK</p></availability>
      </publicationStmt>
      <sourceDesc><msDesc>
        <msIdentifier><repository>Findspot</repository></msIdentifier>
        ...physical description, layout, dimensions, dating...
      </msDesc></sourceDesc>
    </fileDesc>
  </teiHeader>
  <text><body>
    <div type="edition" xml:lang="la">...transcription with line numbers, brackets, certainty markers...</div>
    <div type="translation">...</div>
    <div type="commentary">...</div>
    <div type="bibliography">...</div>
  </body></text>
</TEI>

Notice the four <div type=…> blocks: edition (transcription), translation, commentary, bibliography. This is the EpiDoc convention. Every regional project follows it. An aggregator like EDCS, by contrast, flattens these into separate columns, often losing the editorial-certainty markup.

注意四个 <div type=…> 区块：edition（转写）、translation（译文）、commentary（评注）、bibliography（参考文献）。这是 EpiDoc 的惯例。每个区域项目都遵循它。聚合器（如 EDCS）则把这些字段铺平为独立列，常常在过程中丢失"编辑确定性"标记。

Family 2 · sample · 实例 — IAph 2007

The deep encoding extreme — word-by-word lemmatization in IAph

深度编码的极致，IAph 中逐词的 lemma 标注

The Aphrodisias Inscriptions project (IAph 2007 — Joyce M. Reynolds, KCL) goes one step further than most regional projects: every word in every Greek transcription carries a <w lemma="…"> wrapper that identifies its dictionary form. This makes morphological queries possible at scale — without re-parsing the text.

阿芙罗狄西亚铭文项目（IAph 2007，KCL，Joyce M. Reynolds 主编）比大多数区域项目更进一步：每个希腊文转写中的每个词都包在 <w lemma="…"> 元素里，标注其辞典形（lemma）。这使得"按词形查询"可以批量进行，不需要重新分词。

<ab><lb n="1"/>
<w lemma="οὗτος"><supplied reason="lost">οὗτος</supplied></w>
<w lemma="ὁ"><supplied reason="lost">ὁ</supplied></w>
<w lemma="τόπος"><supplied reason="lost">τόπο</supplied><unclear reason="damage">ς</unclear></w>
<w lemma="ἱερός">ἱερὸς</w>
...

Reading this: "οὗτος" is reconstructed (lost on the stone, supplied by the editor); "τόπος" is partly lost and partly damaged; "ἱερός" is intact. The lemma values let you ask "how often does ἱερός occur in IAph?" without re-tokenizing.

读法："οὗτος" 是补足（石上已失，由编辑补；"τόπος" 部分缺失、部分残损；"ἱερός" 完整保留。lemma 标注让你能够直接问："ἱερός 在 IAph 中出现多少次？"，而不需要重新切词。

Trade-off: This depth costs editorial labour. IAph 2007 has 1,489 records — a large team's lifetime work. EDCS has 537,000 records, but no lemma tagging at all. Depth and breadth are inversely correlated.

取舍：这种深度需要编辑投入大量人工。IAph 2007 收 1,489 条，一支团队几代人的工作。EDCS 收 53.7 万条，但完全没有 lemma 标注。深度与广度成反比。

Case study · 案例 — IRCyr A.1

A Christian mosaic caption — "Noah and the Dove" at Apollonia

基督教马赛克 caption，阿波罗尼亚的"诺亚与鸽"

From the East Church at Apollonia (Cyrenaica), 6th century CE: a mosaic floor depicting Noah releasing the dove from the Ark, with a Greek caption. Read this record as one example of how EpiDoc captures genre + visual programme + Christian context.

来自昔兰尼加阿波罗尼亚的东教堂，公元 6 世纪：一幅马赛克地板，描绘诺亚从方舟中释放鸽子，旁附希腊文 caption。把这条记录当作一个范例：EpiDoc 如何同时记录"体裁 + 视觉方案 + 基督教语境"。

Identifiers

<idno type="filename">A.1</idno>
<idno type="ircyr2012">A00300</idno>

Two parallel IDs: project filename + project's own 2012 numbering. No TM-ID — the inscription is too local for that registry.

两个并行 ID：项目文件名 + 项目 2012 年的编号。无 TM-ID，该铭文对那套登记来说过于在地。

Genre + content

<title>
  <rs type="textType">Caption</rs>
  for image of Noah
</title>

textType="Caption" distinguishes this from a dedication, an epitaph, an acclamation, etc. Joinable with all other "captions" in the corpus.

textType="Caption" 把它与"题献""墓志""欢呼语"等区别开。可与语料中其他所有"caption"做联接。

Layout — text and image as a single iconographic field

<layout>
  <rs type="execution" key="tessellis">Mosaic</rs> lettering, in a panel
  representing <rs type="iconography">Noah</rs> as he releases the
  <rs type="iconography">dove</rs> from the <rs type="iconography">Ark</rs>.
</layout>

Dating — sixth century, by mosaic style

<origDate notBefore="0501" notAfter="0600"
          precision="low" evidence="mosaic">Sixth century CE</origDate>

Notice the evidence="mosaic" — this is the chain of reasoning made queryable. You could ask "all mosaic-dated Christian captions in IRCyr."

注意 evidence="mosaic"：把"推理链"做成可查询。可以问"IRCyr 中所有以马赛克风格定年的基督教 caption"。

Keywords — formal classification

<keywords scheme="IRCyr">
  <term><geogName type="ancientRegion" key="Cyrene">Cyrenaica</geogName></term>
  <term><geogName type="modernCountry" key="LY">Libya</geogName></term>
  <term><placeName type="modernFindspot"
                ref="http://sws.geonames.org/81584">Marsa Suza</placeName></term>
  <term>Christian</term>
</keywords>

The full record exists in your local ~/Documents/epidoc/cyrenaica/A.1.xml. Across all of IRCyr's 2,360 records, this same structural vocabulary applies — which makes corpus-level questions ("how many Christian inscriptions in 6th-century Cyrenaica are mosaic captions?") into single XPath queries.

完整记录就放在你本地的 ~/Documents/epidoc/cyrenaica/A.1.xml。IRCyr 全部 2,360 条记录都遵循同一套结构化词汇，因此"6 世纪昔兰尼加有多少基督教马赛克 caption"这种语料级问题，化作一行 XPath 查询。

Case study · 案例 — IGCyr 002200

An archaic graffito — a kothon scratched ca. 500 BCE

一段古风涂刻，约公元前 500 年的 kothon 刻文

A different genre: a Greek dedication scratched onto a Corinthian-ware kothon (a small ritual jug), found at Taucheira (Cyrenaica) in 1963–65. The IGCyr record (002200) is shorter than the Roman entablature in the previous case study — but the same structural skeleton is there.

另一个体裁：一段希腊文奉献铭文，刻于一只科林斯式 kothon（小祭祀壶）的器壁上，1963–65 年间在塔乌克拉（昔兰尼加）出土。IGCyr 记录（002200）比上一案例的罗马檐部短得多，但同一套结构骨架仍然在场。

physDesc — fragmentary support

<support><p>Two adjacent fragments of a
  <objectType>kothon</objectType>,
  <material>Corinthian ware</material>
  (<dimensions>
    <width atLeast="0.074">0.074-</width>
    <height/><depth/>
  </dimensions>).</p></support>

Width recorded as atLeast="0.074" — a one-sided bound, since the object is broken. Height and depth are empty elements: explicit "we don't know."

宽度记为 atLeast="0.074"，单边下界，因物已残。高、深为空元素：显式表达"未知"。

handDesc — letter forms as evidence

<handNote>
  <height atLeast="0.003" atMost="0.006">
    0.003-0.006</height>;
  epsilon with three equal bars,
  straight iota,
  rho without tail.
</handNote>

Letter shapes — three-bar epsilon, straight iota, tail-less rho — are the basis of the date estimate. Encoded as prose, but cite-able.

字母形态（三条平行 epsilon、笔直 iota、无尾 rho）是定年依据。以散文记录，但可被引用。

execution = "graffito" — a controlled-vocabulary tag

<layout>
  <rs type="execution" key="graffito">Scratched</rs> on wall, below frieze.
</layout>

The execution field distinguishes graffiti from formal inscriptions, mosaics, paint, etc. How does scratched-on-pottery dedication culture differ from formally-cut stone dedication? — the kind of question this controlled value lets you frame.

execution 字段把涂刻区分于正式刻铭、马赛克、绘文等。"陶器上刮刻的奉献文化"与"石上正式刻凿的奉献"如何不同？"，受控值让这类问题得以提出。

provenance — three states

<provenance type="found" notBefore="1963" notAfter="1965">
  Found between 1963 and 1965 at Taucheira: archaic votive deposit.
</provenance>
<provenance type="observed" when="1965" resp="Boardman">
  Seen in 1965 by J. Boardman.
</provenance>
<provenance type="not-observed">
  Not seen by IGCyr team.
</provenance>

Three sequential events: found, observed by Boardman, not observed by the present editors. The negative event is itself a record — modern editorial honesty about not having had autopsy.

三个先后事件：出土、被 Boardman 观察、本版编辑未观察。"未观察"也是一项记录，当代编辑对"未亲自察看"的诚实。

The whole IGCyr corpus is licensed CC BY-NC 4.0; bulk download via Bologna's repository at doi.org/10.6092/UNIBO/IGCYRGVCYR.

整个 IGCyr 语料库采用 CC BY-NC 4.0；批量下载经博洛尼亚机构知识库 doi.org/10.6092/UNIBO/IGCYRGVCYR。

Family 2 · sample · 实例 — IOSPE

A bilingual EpiDoc — IOSPE publishes everything in Russian + English

一个双语 EpiDoc，IOSPE 把一切都同时用俄语与英语发表

The Inscriptions of the Northern Black Sea Coast (IOSPE) project is a methodologically distinctive case: every editorial field in the TEI is repeated in both Russian and English (with xml:lang="ru" and xml:lang="en"). The project's intended user-base spans two scholarly communities with little overlap; the database serves both natively.

"北黑海沿岸铭文"项目（IOSPE）是方法论上的独特案例：TEI 中的每一个编辑字段都用俄语与英语并列两版（分别带 xml:lang="ru" 与 xml:lang="en"）。项目预设的使用者跨越两个几乎不重叠的学术圈，数据库为二者同时原生服务。

<origDate notBefore="1451" notAfter="1452"
            evidence="эпиграфический_контекст" cert="low">
  <seg xml:lang="ru">1451–1452 гг.</seg>
  <seg xml:lang="en">1451–1452 C.E.</seg>
</origDate>
<provenance type="observed" subtype="autopsy">
  <seg xml:lang="ru">Non vidi.</seg>
  <seg xml:lang="en">Non vidi.</seg>
</provenance>

What this lets you do: read the same record in either language, do bilingual NLP work, search Russian terminology against English equivalents. What it costs: editorial labour roughly doubled. Who benefits: Black-Sea archaeology, where Russian-language scholarship is the largest body.

由此能做什么：同一记录可任选语言阅读；可做双语 NLP；可把俄语术语与其英语对应版互查。代价：编辑工作量约翻倍。受益者：黑海考古学界，俄语文献是该领域最大的存量。

Family 3 · 家族三 — Text-only · 纯文本

PHI and the find-the-string family

PHI 与"找字串"家族

PHI Greek Inscriptions

name

Packard Humanities Institute · Greek Inscriptions

scope

any Greek epigraphic text

records

~250,000

encoding

plain text + minimal HTML headers

images

— none —

strength

Universal text search across the Greek corpus

link

epigraphy.packhum.org ↗

Searchable Greek Inscriptions (Cornell-McCabe)

name

McCabe corpora — published in PHI infrastructure

scope

site-specific Greek corpora (Aphrodisias, Didyma, Ephesus, etc.)

records

tens of thousands across many corpora

encoding

plain text with header metadata

images

— none —

strength

Single-site, all-text searchability before TEI projects existed

link

McCabe corpora list ↗

Why these matter despite their simplicity: for the Greek epigraphic world, PHI has been the de facto search engine for 30 years. Many printed editions cite the PHI ID rather than (or alongside) the IG / SEG number. PHI's data ingestion is also not deduplicated — the same stone may appear under multiple PHI numbers because it was published twice in different series.

为何"简朴"仍然要紧：就希腊铭文世界而言，PHI 是事实上的搜索引擎，已三十年。许多印本会用 PHI 编号引用，甚至代替（或与）IG / SEG 编号。PHI 的数据收录 不去重，同一块石头可能因被两套出版物各自著录，在 PHI 中出现多个编号。

See also 另见

DB Literacy · PHI 175744 vs PHI 140601 · the same stone, twice — pluralization in action同一块石头，被收两次，复制现象
↓ canonical-greekLit case study · how Greek literature TEI differs from PHI's flat text希腊文学 TEI 与 PHI 纯文本的差异
↓ Family 6 — iPHI / Ithaca · PHI as ML training corpusPHI 作为 ML 训练语料

Family 4 · 家族四 — EAGLE / 意大利语联盟

The EAGLE federation — Italian-language partners

EAGLE 联盟，意大利语合作伙伴

EAGLE (Europeana network of Ancient Greek and Latin Epigraphy) was an EU-funded effort to harmonise Italian regional databases under a shared identifier scheme. The data infrastructure outlived the project; the federation today is a loose alliance.

EAGLE（"欧洲古希腊语与拉丁语铭文网"）是欧盟资助、试图把意大利各区域数据库纳入一套共享标识符方案下的努力。项目结束后，数据基础设施仍在运转；今天它是一个松散联盟。

Database数据库	Coverage覆盖	Records条数	Strong on强项	Link
EDR	Italy + Italian islands (Latin)	~80,000	regional Italian metadata, photographs	edr-edr.it ↗
EDB	Christian inscriptions of Rome	~40,000	Christian inscriptions, catacombs	edb.uniba.it ↗
EDV	Vienna's epigraphic database	varies	Roman provinces of Pannonia + Noricum	edv ↗
EAGLE Mediawiki	federation portal — vocabularies, ID resolver	n/a	cross-database ID resolver	eagle-network.eu ↗
Ubi Erat Lupa	Roman stone monuments (image-first)	~50,000+	photographs, especially funerary stelai	lupa.at ↗

The EAGLE promise: a single resolver that, given any partner's ID, would return the records held in all the others. The reality: partial. Some pairs of partners cross-reference reliably, others don't — so any "EAGLE-aware" query has to be coded defensively.

EAGLE 的承诺：建一个解析器，给定任一合作方的 ID，即可返回其他成员所持有的对应记录。现实：部分实现。某些合作方间交叉引用稳定，另一些则不稳定，因此任何"EAGLE 感知"的查询都必须做防御性编码。

See also 另见

DB Literacy · EDR vs OSU panel · EDR 140617 record close-upEDR 140617 记录细看
Case Study · Issue 1: 11 IDs · EAGLE partner_link as join attemptEAGLE partner_link 作为联接尝试
↑ Foundations · Controlled vocabulary · EAGLE Vocabularies in detailEAGLE 词表细节

Family 5 · 家族五 — Papyri · 莎草纸

A parallel TEI EpiDoc world — papyrology

TEI EpiDoc 的平行宇宙，莎草纸学

Papyrology and epigraphy are technically distinct (papyrus vs stone) but share the same TEI EpiDoc tooling, the same Leiden conventions, and a sharper identifier ecosystem. Their backbone is papyri.info, a single portal that aggregates DDbDP texts + HGV metadata + APIS images.

莎草纸学与铭文学技术上有别（介质是莎草纸而非石头），但共用 TEI EpiDoc 工具链、共用 Leiden 惯例，且 ID 生态更清晰。它们的中枢是 papyri.info，单一门户，聚合 DDbDP 文本、HGV 元数据、APIS 图像。

Database数据库	What it provides提供什么	IDID	Records条数
DDbDP	text editions of documentary papyri (TEI)	`cde.85.247`	~60,000
HGV	papyrological metadata (date, place, text-type)	`HGV 79368`	~70,000
APIS	papyrus images (American consortium)	institutional	~60,000
DCLP	literary papyri (Heidelberg + Duke)	TEI	~10,000+
Trismegistos	universal join key — TM-ID across all of the above	`TM 79368`	~800,000+
papyri.info	single-portal aggregation of all of the above	n/a — federated	resolves all

The lesson for inscriptions: papyrology has solved the federation problem the inscription world is still working on. A papyrus has one TM-ID; that one number resolves to text (DDbDP), metadata (HGV), and images (APIS), each maintained by a different institution. Inscription databases — even within EAGLE — don't have an equivalent universal resolver yet.

对铭文界的启示：莎草纸学已解决了铭文界仍在挣扎的"联邦化"问题。一份莎草纸只有一个 TM-ID；这一个号码解析为文本（DDbDP）、元数据（HGV）、图像（APIS），各由不同机构维护。铭文数据库，即使在 EAGLE 内部，至今没有等同的"通用解析器"。

See also 另见

↓ DDbDP record close-up · cde.85.247
↓ Papyrus body line-by-line case study
↓ papyri.info architecture diagram
↓ Identifier systems table — TM-ID role

Family 5 · sample · 实例 — DDbDP

A papyrus record — cde.85.247 with HGV + TM

一份莎草纸记录，cde.85.247 附 HGV + TM

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en">
  <teiHeader><fileDesc>
    <titleStmt><title>cde.85.247</title></titleStmt>
    <publicationStmt>
      <authority>Duke Collaboratory for Classics Computing (DC3)</authority>
      <idno type="filename">cde.85.247</idno>
      <idno type="ddb-hybrid">cde;85;247</idno>
      <idno type="HGV">140713</idno>
      <idno type="TM">140713</idno>
      <availability>CC BY 3.0</availability>
    </publicationStmt>
  </fileDesc></teiHeader>
  <text><body>...transcription...</body></text>
</TEI>

Notice the four identifiers in one block: filename, ddb-hybrid, HGV, and TM. Each is the join key into a different system. Most papyri carry all four. Most stone inscriptions, by contrast, carry only one (the project-local one).

注意一段里四个标识符：filename、ddb-hybrid、HGV、TM。每一个都是通往另一个系统的"联接键"。绝大多数莎草纸都同时带全部四个。相比之下，绝大多数石质铭文只带一个（项目自定的那个）。

Practical implication: if you build an analysis pipeline that joins inscription text to geography, dating, or images, you will spend most of your engineering time figuring out which records correspond. With papyri, this is a JOIN ON tm. With inscriptions, it is custom code per database.

实际意味：若你写一个把"铭文文本 → 地理 → 定年 → 图像"联起来的分析管道，绝大多数工程时间会耗在搞清楚"哪条记录对应哪条"。莎草纸只需 JOIN ON tm。铭文则要为每个数据库写一段自定义代码。

Family 6 · 家族六 — ML datasets · 机器学习数据集

iPHI / Ithaca — when an epigraphic database becomes a training corpus

iPHI / Ithaca，当铭文数据库变成模型训练语料

DeepMind's Ithaca (Assael et al., Nature 2022) is a transformer-based model that proposes restorations, geographical attributions, and dates for damaged Greek inscriptions. Its training data — iPHI — is PHI's text dump, deduplicated and re-formatted into train/val/test splits.

DeepMind 的 Ithaca（Assael 等，Nature 2022）是一个基于 transformer 的模型，可为损坏的希腊文铭文给出修复、地理归属、定年的建议。它的训练数据，iPHI：是 PHI 文本导出版，经去重并重新切分为训练 / 验证 / 测试集。

What gets kept

Greek text characters, normalised
归一化后的希腊文字符
A label for region (one of 84)
区域标签（84 选 1）
A label for date (a 50-year window)
年代标签（50 年窗口）
Span markers for damaged characters
损坏字符的位置标记

What gets dropped

Editorial certainty markup
编辑确定性标记
Material, dimensions, layout
材质、尺寸、版面
Bibliography, apparatus
参考文献、异文校勘
All non-Greek metadata
所有非希腊文元数据

The lesson: ML datasets are distilled versions of the original corpus. Ithaca's restorations are extraordinary, but the model never sees the stone, the squeeze, the photo, or the editorial commentary. When the model proposes a date, it is generalising over the patterns in the labels — including the biases in those labels.

启示：ML 数据集是原语料库的 蒸馏版。Ithaca 的修复能力非凡，但它从未看到石头、压模、照片或编辑评注。当模型给出年代时，它是在标签的模式上泛化，也包括那些标签中的偏差。

Other ML datasets in the family: PHI-Latin (Latin-side mirror, in development), Latin BERT (trained on classical Latin corpora — not just inscriptions), and various single-task fine-tunings released as Hugging Face checkpoints.

家族内其他 ML 数据集：PHI-Latin（拉丁面镜像，开发中）、Latin BERT（在古典拉丁语料上训练，不限铭文）、以及若干以 Hugging Face checkpoint 形式发布的单任务微调模型。

See also 另见

↓ Tools landscape — Ithaca + Latin BERT cards
↓ SDAM ETL chain · how raw records become a parquet for ML
Paper · § 7 the methodological commitments · honest-uncertainty stance applies to ML too"诚实不确定性"立场亦适用于 ML

Adjacent · 邻居 — canonical-greekLit

A literary text in TEI — Galen on the Natural Faculties

TEI 中的文学作品，盖伦《论自然机能》

If you work with inscriptions and you also use literary texts, you'll meet TEI in this other shape. The canonical-greekLit repository (Open Greek and Latin / Perseus) holds works of ancient Greek authors, each one a TEI XML file inside a folder structure that is itself a citation system.

若你做铭文研究、又用古典文学文本，会遇到 TEI 的另一形态。canonical-greekLit 仓库（Open Greek and Latin / Perseus 项目）收录古希腊作家作品，每一篇都是一份 TEI XML 文件，所在文件夹结构本身就是一套引用系统。

data/
  tlg0057/                                     ← Galen (TLG author #57)
    __cts__.xml                                ← textgroup metadata
    tlg010/                                    ← work #10 (On the Natural Faculties)
      __cts__.xml                              ← work-level metadata
      tlg0057.tlg010.perseus-grc2.xml          ← Greek text edition
      tlg0057.tlg010.perseus-eng2.xml          ← English translation

The CTS URN — a citation primitive

urn:cts:greekLit:tlg0057.tlg010.perseus-grc2:1.5.3
              ↑          ↑     ↑          ↑     ↑
              namespace  author work  edition   passage

Reading: "in the canonical Greek literature collection, Galen's work tlg010, the Perseus 2 Greek edition, book 1, chapter 5, section 3." This URN can be plugged into Scaife Viewer to render that exact passage.

读法："在 canonical-greekLit 集合中，盖伦作品 tlg010，Perseus 第 2 希腊文版，第 1 卷第 5 章第 3 节。"此 URN 可输入 Scaife Viewer，直接呈现该段。

How this differs from inscription EpiDoc: for literature, the structural skeleton is book / chapter / section, not edition / translation / commentary / apparatus. The same TEI namespace; different conventions about which sub-elements appear and how they nest. EpiDoc projects use <div type="edition">; literary projects use <div type="textpart" subtype="book" n="1">. Both are valid TEI.

与铭文 EpiDoc 的差别：文学作品的结构骨架是 卷／章／节，而非 edition／translation／commentary／apparatus。同一个 TEI 命名空间；不同惯例下子元素出场顺序与嵌套方式不同。EpiDoc 项目用 <div type="edition">；文学项目用 <div type="textpart" subtype="book" n="1">。两者都是有效 TEI。

Why this matters for inscription work: inscriptions sometimes quote, paraphrase, or refer to literary texts. With both sides in TEI + URNs, you can join "Pindar Ol. 13.5" mentioned in IRCyr to the actual passage in Pindar's TEI file. Without that infrastructure, you'd be doing it by hand.

这对铭文研究为何要紧：铭文有时引用、化用、提及文学文本。两端都在 TEI + URN 体系下，你可以把 IRCyr 中提到的 "Pindar Ol. 13.5" 联接到 Pindar TEI 文件中的实际段落。没有这套基础设施，就只能逐一人工处理。

Case study · 案例 — One record's journey through SDAM

From CIL X 7296 to a parquet row — the seven-step journey

从 CIL X 7296 到一行 parquet，七步旅程

A concrete walk-through of how one inscription becomes a row in the SDAM analysis-ready dataset, naming the database family at each step:

具体走一遍：一块铭文如何变成 SDAM 分析就绪数据集中的一行，每一步命名所属"家族"：

Step步	What happens发生什么	Format格式	Family家族
1	Editor (Mommsen, 1883) transcribes the stone in autopsy编辑（Mommsen，1883）实地察看后转写	printed CIL X volume印本 CIL X 卷	print authority印本权威
2	EDCS team re-keys CIL into a relational recordEDCS 团队把 CIL 重新键入关系型记录	HTML web record	F1 aggregator
3	SDAM ETL pipeline (`EDCS_ETL`) scrapes the EDCS web recordSDAM ETL（`EDCS_ETL`）抓取 EDCS 网页记录	raw HTML → JSON	F1 + ETL
4	Pipeline runs regex cleaning to produce `conservative` + `interpretive`管道跑正则清洗，产出 `conservative` 与 `interpretive`	cleaned JSON fields	F1 + ETL
5	Geographic info joined to Pleiades URI for the place将地理信息联接到 Pleiades 的地名 URI	JSON-LD enriched	linked-data
6	Date range fed into `tempun` Monte Carlo sampling把年代区间送入 `tempun` 做蒙特卡洛抽样	numpy arrays per record	SDAM-internal
7	All ~537,000 records concatenated into a parquet file~537,000 条全数并入一份 parquet	`LIST_v0-3.parquet`	F1 derivative

What survives the chain: the cleaned text, the place coordinates, a date probability distribution, the EDCS-ID, partner_link to PHI/EDR/I.Sicily/OSU. What does not survive: the editor's confidence on individual letters; the apparatus criticus; the photograph URL (often dropped); the structural distinction between Latin and Greek halves of bilingual stones.

这条链上幸存的：清洗文本、地理坐标、年代概率分布、EDCS-ID、partner_link 指向 PHI/EDR/I.Sicily/OSU。不幸存的：编辑者对单个字母的置信度；异文校勘；照片 URL（常被丢）；双语铭文中拉丁面与希腊面的结构区分。

For full mechanics, see Reference Edition § 4.5 (the actual code) or the Visual edition (the same chain visually).

完整机制参见参考版 § 4.5（实际代码）或视觉版（同一条链的视觉版本）。

Encoding · 编码 · 1/3

The shared spine — TEI EpiDoc

共同的主干，TEI EpiDoc

TEI (Text Encoding Initiative) is the XML standard for textual scholarship in the humanities. EpiDoc is its epigraphy customisation — a smaller, sharper vocabulary for writing inscriptions that preserves editorial judgement as structured tags. Most regional EpiDoc projects, papyrology databases, and even some aggregator backends speak this language.

TEI（文本编码倡议）是人文学科文本研究的 XML 标准。EpiDoc 是其铭文学版本，一套更小、更锐利的词表，用以书写铭文，并把编辑判断保留为结构化标签。绝大多数区域 EpiDoc 项目、莎草纸数据库、甚至部分聚合器后台都说这门语言。

EpiDoc tagEpiDoc 标签	What it marks标的对象	Leiden equivalentLeiden 等价物
`<supplied reason="lost">`	letters supplied by the editor (lost on stone)编辑补足的字母（石上已失）	`[ ]`
`<supplied reason="omitted">`	letters not on stone but inferred石上无、但推得	`< >`
`<unclear>`	letters partially visible仅部分可见	`ạ` (dotted)
`<gap reason="lost">`	unrecoverable lost section不可恢复的缺失段	`[- - -]`
`<expan><abbr>`	abbreviation + expansion缩略 + 展开	`D(is) M(anibus)`
`<lb n="…"/>`	line break with line number换行（带行号）	`/ 1`
`<w lemma="…">`	word with dictionary form词＋辞典形（lemma）	(no equivalent)
`<persName>` / `<placeName>`	named entity命名实体	(no equivalent)
`<date notBefore="…" notAfter="…">`	structured date range结构化年代区间	(prose only)

Why this matters: EpiDoc adds three things that Leiden conventions alone cannot — queryable certainty (you can filter for "high-confidence readings only"), queryable lemmas (you can search by lexical form, not by surface string), and queryable named entities (you can ask "how often does Diocletian appear in IRT?").

为何要紧：EpiDoc 在 Leiden 惯例之外多了三样东西，可查询的确定性（可只筛"高置信度的释读"）、可查询的词形（按辞典形而非表面拼写搜索）、以及 可查询的命名实体（可问"Diocletian 在 IRT 中出现多少次？"）。

Encoding · 编码 · 2/3

The older convention — Leiden brackets

更早的惯例，Leiden 括号

Before computers, epigraphers signalled editorial judgement with a system of brackets and dots, agreed at the 1931 Leiden conference. Almost every printed transcription you read still uses them. PHI inherits them as plain-text glyphs; EpiDoc translates them into XML tags.

计算机出现前，铭文学者用 1931 年 Leiden 会议确立的一套括号与符号系统传达编辑判断。几乎你读到的每一份印本转写都还在用它。PHI 把它们当作纯文本字符继承下来；EpiDoc 则把它们翻译成 XML 标签。

The most common signs

[abc]	letters lost, restored by editor已失，由编辑补
[- - -]	unrecoverable gap (3 letters)不可恢复缺口（3 字母）
a̲b̲c̲ or ạḅc	letters partly visible仅部分可见
(abc)	expansion of an abbreviation缩略展开
⟨abc⟩	letters omitted on stone, supplied石上漏刻，补入
{abc}	letters wrongly inscribed, deleted误刻，编辑去除
⟦abc⟧	erased on stone石上被擦除
«abc»	re-inscribed over an erasure擦除处重新刻
vac.	vacant space (uncut)空格（未刻）

Mapping to EpiDoc

[abc]      → <supplied reason="lost">abc</supplied>
⟨abc⟩      → <supplied reason="omitted">abc</supplied>
{abc}      → <surplus>abc</surplus>
⟦abc⟧      → <del rend="erasure">abc</del>
ạḅc        → <unclear>abc</unclear>
[- - -]    → <gap reason="lost" extent="3"
                 unit="character"/>
(abc)      → <expan><abbr/><ex>abc</ex></expan>
vac.       → <space unit="character"
                  extent="…"/>

A printed CIL volume page will read like a thicket of brackets. EpiDoc XML reads like a thicket of tags. The two convey the same editorial information; the second is queryable, the first is not.

一页印本 CIL 是一片"括号丛林"。一份 EpiDoc XML 是一片"标签丛林"。二者传达同样的编辑信息；后者可查询，前者不可。

Encoding · 编码 · 3/3

The flatter alternatives — Dublin Core, plain CSV, JSON-LD

更扁平的替代，Dublin Core、CSV、JSON-LD

Dublin Core

A 15-element metadata standard used by libraries and museums (DSpace, ePrints). What OSU's Knowledge Bank uses for its inscription photographs. Names objects with title, creator, date, subject, language, identifier, etc. — but doesn't model an inscription's editorial structure.

图书馆与博物馆（DSpace、ePrints）通用的 15 字段元数据标准。OSU 知识库为其铭文照片所用。给对象配 title、creator、date、subject、language、identifier 等，但不建模铭文的编辑结构。

Flat CSV / JSON

EDCS's bulk export format. One row per inscription, fixed columns. Easy to load with pandas; easy to chart; loses everything that doesn't fit a column. The SDAM ETL pipeline lives here.

EDCS 批量导出的格式。一行一条铭文，列固定。pandas 易读、易制图；放不进列里的东西都会丢失。SDAM ETL 管道就生活在这一层。

JSON-LD / Linked Data

An aspirational layer where each field carries a URI naming what kind of thing it is. Pleiades is JSON-LD. EAGLE's vocabularies aspire to be. This is the world LOD-upgrade-prompts gesture toward.

一种"愿景"层：每个字段都带 URI 标明"它是什么类型"。Pleiades 用 JSON-LD。EAGLE 的词表"志在"如此。LOD 升级提案所指向的世界。

The trade-off: richer encoding (TEI EpiDoc → Linked Data) lets you do more sophisticated queries but requires more editorial labour and a larger learning curve. Flatter encoding (CSV → Dublin Core) is easier to load into pandas / a database / a model but loses fidelity to the editor's judgement. Most projects pick one position on this spectrum and stay there; ETL pipelines like SDAM's translate from one to another, accepting the loss.

取舍：越富的编码（TEI EpiDoc → Linked Data）越能做复杂查询，但要求更多编辑投入与更陡的学习曲线。越扁平的编码（CSV → Dublin Core）越容易加载进 pandas / 数据库 / 模型，但更难忠于编辑判断。大多数项目会在这条光谱上选一个位置守住；像 SDAM 这样的 ETL 管道则在两端之间转换，并承认转换中的损失。

Linked data · 关联数据

RDF, JSON-LD, SPARQL — turning fields into a graph

RDF、JSON-LD、SPARQL，把字段变成一张图谱

Linked data treats every fact as a triple: subject — predicate — object. Instead of "the inscription EDCS-22000882 was found in Palermo," you write three things:

关联数据把每一条事实视为一个三元组：主语 — 谓语 — 宾语。"铭文 EDCS-22000882 出土于巴勒莫"被写成三件事：

<edcs:22000882> <dcterms:isFoundAt> <pleiades:462492> .   ← findspot triple
<pleiades:462492> <rdfs:label> "Panhormus" .                ← human-readable label
<pleiades:462492> <geo:lat> "38.111" .                       ← geo coordinate

Each subject and predicate is itself a URI. Anyone can add a triple about edcs:22000882 from a different system, and a query engine can join across all of them.

每个主语与谓语本身都是一个 URI。任何人可在另一系统中为 edcs:22000882 添加一条三元组；查询引擎可跨系统联接。

RDF / JSON-LD

The serialisation. RDF can be written as Turtle, RDF/XML, N-triples, or JSON-LD. JSON-LD is most common today because it looks like ordinary JSON but resolves predicates to URIs.

序列化形式。RDF 可写成 Turtle、RDF/XML、N-triples 或 JSON-LD。JSON-LD 当今最常见，看着像普通 JSON，但谓语会解析为 URI。

{"@context": {"foaf": "http://…"},
 "@id": "edcs:22000882",
 "foaf:depicts": {"@id":"pleiades:462492"}
}

SPARQL

The query language. You write a graph pattern with variables; the server matches it.

查询语言。你写带变量的图谱模式，服务器返回匹配。

SELECT ?inscription ?place WHERE {
  ?inscription foaf:depicts ?place .
  ?place geo:in_region pleiades:Sicilia .
  ?inscription rdf:type epidoc:Bilingual .
}

Where this lives in epigraphy: Pleiades exposes a SPARQL endpoint and JSON-LD serialisation. EAGLE's vocabularies are linked-data resources. Some EpiDoc projects (I.Sicily, IRT) expose JSON-LD per record. Where it does not yet live: EDCS, EDH, PHI — these speak HTML/JSON, not the graph. The "ETL into linked data" project is real but ongoing.

它在铭文学中已落地之处：Pleiades 提供 SPARQL endpoint 与 JSON-LD 序列化；EAGLE 词表本身就是关联数据资源；部分 EpiDoc 项目（I.Sicily、IRT）按条记录提供 JSON-LD。尚未落地之处：EDCS、EDH、PHI，它们说 HTML/JSON，不说图谱。"ETL 到关联数据"的工程真实存在，但仍在推进中。

Image protocol · 图像协议 — IIIF deep dive

IIIF in detail — two APIs, four institutions, one viewer

IIIF 细解，两套 API、四家机构、一个查看器

IIIF (International Image Interoperability Framework) is the standard most well-funded epigraphy projects use to publish images. It defines two APIs:

IIIF（国际图像互操作框架）是经费充足的铭文项目发布图像所用的标准。它定义两套 API：

IIIF Image API

A URL pattern that returns a region of an image at any zoom and rotation. So …/full/600,/0/default.jpg asks for "the full image, scaled to 600px wide, no rotation, default quality."

一种 URL 模式：返回图像在任意缩放与旋转下的某个区域。如 …/full/600,/0/default.jpg 表示"完整图像、缩放至 600 像素宽、不旋转、默认品质"。

.../iiif/2/ISic000470/
   full/full/0/default.jpg
.../iiif/2/ISic000470/
   2000,1500,400,400/full/0/default.jpg

IIIF Presentation API

A JSON-LD "manifest" that describes a multi-page object: title, sequence of canvases, each canvas pointing to one or more Image-API images, optionally with annotations.

一份 JSON-LD "manifest"，描述多页对象：标题、画布序列；每个画布指向一或多张 Image-API 图像，可附标注。

{
  "@context": "http://iiif.io/...",
  "type": "Manifest",
  "items": [{ "type": "Canvas", ... }]
}

Why this matters for inscription study: a IIIF manifest can pull a photograph from KCL, a squeeze scan from Heidelberg, and a 3D-derived rendering from a third institution — into a single Mirador or Universal Viewer reading. You write the manifest once; the user sees a unified object. The bibliographic problem of "where is this stone visible?" becomes the technical problem of "stitching manifests."

这对铭文研究为何要紧：一份 IIIF manifest 可同时拉取 KCL 的照片、海德堡的压模扫描、第三方机构的 3D 渲染，在 Mirador 或 Universal Viewer 中合为单一阅读体验。manifest 写一次，用户看到的是统一对象。"这块石头在哪里可见"的文献学问题，由此化为"manifest 缝合"的技术问题。

Who currently uses it: I.Sicily (full IIIF), AshLI, partial IRT, Vindolanda Tablets Online (multispectral). Who does not (yet): EDCS, EDH, PHI — these still serve plain JPGs at fixed URLs.

目前使用 IIIF 的：I.Sicily（完全 IIIF）、AshLI、部分 IRT、Vindolanda 多光谱版。尚未使用的：EDCS、EDH、PHI，仍以固定 URL 提供普通 JPG。

Citation · 引用

How to cite a digital inscription record — a working template

如何引用一份数字铭文记录，可用模板

In a paper, you cite an inscription by stacking three layers: the print authority, the digital project, and the access metadata. Here is a working template plus a real example:

论文中引用铭文，叠加三层：印本权威、数字项目、访问元数据。下面是模板与真实示例：

Template

[Print citation: corpus, volume, number]
  = [Digital project name] [project ID]
    ([URL], accessed [date], rev. [last revision date]).

Real example — the bilingual stonecutter sign

CIL X 7296 = IG XIV 297
  = I.Sicily ISic000470
    (http://sicily.classics.ox.ac.uk/inscription/ISic000470,
     accessed 2025-04-15, rev. 2024-06-12)
  = EDCS-22000882
    (https://edcs.hist.uzh.ch/?edcs-id=22000882, accessed 2025-04-15)
  = EDR140617
    (http://www.edr-edr.it/edr_programmi/res_complex_comune.php?do=book&id_nr=edr140617).

Why all three layers: the print citation is permanent (the book exists in libraries); the digital project ID lets a reader find your exact record; the access date + revision date pin which state of the data you saw. If the digital record changes, your citation still resolves to "the version I saw."

为何要三层：印本引用是永久的（图书馆里有书）；数字项目 ID 让读者能找到你看的那条记录；访问日期 + 最近修订日期固定了你所看到的数据状态。即使日后数字记录有变，你的引用仍能解析到"我当时看到的那一版"。

For computational analyses: additionally cite the dataset version. The SDAM convention is to cite the parquet release (e.g., LIST_v0-3.parquet) so any reader can re-derive your numbers from the same starting state.

计算性分析另外要引用 数据集版本。SDAM 的惯例是引用 parquet 发行版（如 LIST_v0-3.parquet），让读者能从同一起点重现你的数字。

Identifier systems · 标识符系统

The names by which a stone is known to a database

数据库给一块石头起的名字

Each project assigns its own ID. Some try to be a universal join key (TM, Pleiades for places); others are project-local (ISic, IRT, edcs-id). To find one stone in two databases you usually need a hand-curated cross-reference.

每个项目都自定义 ID。有些试图扮演通用联接键（TM；Pleiades 之于地名）；另一些是项目内部用（ISic、IRT、edcs-id）。要在两个数据库中找到同一块石头，通常需要人工维护的交叉引用。

ID systemID 系统	Run by维护方	Format格式	Purpose用途
TM	Trismegistos (Leuven)	`TM 79368`	universal join across papyri + ostraca + inscriptions
Pleiades	ISAW	`462492`	universal place identifier (Mediterranean places)
EDCS-ID	Heidelberg	`EDCS-22000882`	EDCS internal
EDH-HD	Heidelberg	`HD046800`	EDH internal
PHI numeric	Packard Humanities	`175744`	PHI internal — opaque numeric
ISic000470	Oxford / I.Sicily	seven-digit padded	I.Sicily filename + URI
IRT0001 / IRCyr A.1	KCL	project prefix + number	regional EpiDoc filenames
HGV	Heidelberg papyrology	`HGV 79368`	papyrological metadata
DDB-hybrid	Duke	`cde;85;247`	papyrological text editions
handle / DOI	various	`1811/99008`	persistent identifier for institutional repositories (OSU KB, Bologna, etc.)

The aspiration: TM-IDs as the universal join. The reality: TM coverage is extensive for papyri (~800k) but uneven for stone inscriptions; many inscriptions don't have a TM-ID at all. Pleiades is more reliable for places. The EAGLE ID resolver exists but is partial.

愿景：TM-ID 作为通用联接键。现实：TM 在莎草纸上覆盖广（约 80 万），在石质铭文上覆盖参差；很多铭文根本没有 TM-ID。Pleiades 在地名上更可靠。EAGLE ID 解析器存在，但不完整。

Images · 图像

Where a database carries pictures — and where it doesn't

哪些库带图，哪些不带

Image strategy图像策略	Used by由谁采用	Strengths优点	Limitations局限
IIIF deep-zoom	I.Sicily, IRT (partial), AshLI, IIIF-aware projects	deep zoom + cross-repository composition + stable URI	requires IIIF server; rare for older projects
JPG URL field	EDCS, EDH, EDR, Ubi Erat Lupa	simple, easy to fetch in bulk	no zoom, often low resolution, link rot
Squeeze archive scans	EDH, some museum collections	captures depth + contour	monochrome, often only one face
Photograph repository (no transcription)	OSU KB / CEPS, Berlin DAI photo archive	highest-resolution single asset	not linked to a transcription database
3D model	EM Athens collection (partial), some research projects	captures 3D form, supports surface analysis	tiny coverage; large file sizes
None	PHI, McCabe corpora, most ML datasets	n/a — pure text	no visual control over editor's reading

What this means in practice: if your research question depends on visual evidence (lettering style, layout, paint traces), you need to identify which databases hold the image for any given inscription — and accept that for many inscriptions, no public image exists at all. Image presence is itself a piece of metadata most aggregator dumps do not preserve.

实际意味：若你的研究问题依赖视觉证据（字体、版面、留色），你需要先确认哪些数据库拥有特定铭文的图像，并接受一个事实：很多铭文根本没有公开图像。"是否有图"本身就是一种元数据，而绝大多数聚合器导出版并不保留它。

Licensing · 授权

What you may legally do with the data — a brief survey

合法可做的事，一份简表

Database数据库	Licence授权	Bulk download批量下载	Commercial reuse商业再利用
I.Sicily	CC BY 4.0	GitHub repo	yes (with attribution)
IRT, IRCyr	CC BY 2.0 UK	GitHub / KCL	yes (with attribution)
IGCyr / GVCyr	CC BY-NC 4.0	Bologna repo	no (NC clause)
IAph 2007	CC BY 2.5	insaph.kcl.ac.uk	yes (with attribution)
IOSPE	CC BY 4.0	KCL	yes (with attribution)
EDH	CC BY-SA	SDAM ETL public dump	yes (with share-alike)
EDCS	site terms	SDAM ETL via partners; not officially open	grey area — check terms
PHI	site terms	not formally open	grey area
EDR / EDB	site terms / EAGLE terms	via EAGLE federation	grey area
DDbDP / HGV / APIS	CC BY 3.0	papyri.info, GitHub	yes (with attribution)
iPHI / Ithaca	Apache 2.0 (code) + CC BY (data)	GitHub + HF dataset	yes
OSU KB images	institutional terms	per-image	often no — request permission

Practical advice: for any reproducible publication, prefer projects with explicit CC licenses. The SDAM ETL pipelines navigate the grey areas by maintaining their own re-distributable derivatives, hosted on partners (sciencedata.dk) and citable independently of the original databases.

实操建议：就可复现出版而言，优先选用授权明确的 CC 项目。SDAM ETL 管道之所以能在"灰色地带"里通行，是因为它维护自有的可再分发衍生版，托管在合作平台（sciencedata.dk），并独立于原数据库进行引用。

The print ancestry · 印本的祖先

The 19th-century print authorities every database descends from

每个数据库都源自的 19 世纪印本权威

Almost every digital inscription database — even the newest ML dataset — ultimately descends from 19th-century printed editions. Knowing the print line helps you read the digital fields back to the editor who first transcribed the stone.

几乎每一个数字铭文数据库，包括最新的 ML 数据集，都最终源自 19 世纪的印本。了解印本谱系，能让你把数字字段读回到当年第一位转写这块石头的编辑。

1828–

CIG · Corpus Inscriptionum Graecarum (Boeckh, Berlin) — the Greek pole.

1853–

CIL · Corpus Inscriptionum Latinarum (Mommsen, Berlin) — the Latin pole. Still the master reference for Latin epigraphy. See an actual page below.

1873–

IG · Inscriptiones Graecae — the systematic Greek successor to CIG, organised by region.

1888

L'Année Épigraphique (AE) — annual round-up of newly published Latin inscriptions. Still appears yearly.

1892

ILS · Dessau, Inscriptiones Latinae Selectae — selected Latin inscriptions with topical organisation.

1923

SEG · Supplementum Epigraphicum Graecum — annual round-up of newly published Greek inscriptions.

1888–

BE · Bulletin Épigraphique — Robert's annual review/critique in Revue des Études Grecques.

1901–

TAM · Tituli Asiae Minoris — Asia Minor systematic publication.

1928–

MAMA · Monumenta Asiae Minoris Antiqua — Phrygia and inland Asia Minor.

1881–

CIS · Corpus Inscriptionum Semiticarum — Semitic-language inscriptions (Phoenician, Aramaic, etc.).

Relevant when reading any database: the publication or idno fields you find on a digital record will almost always include a CIL / IG / AE / SEG number. That is the digital project saying: "for the authoritative print transcription, see here." Any ambiguity in the digital text usually traces back to a disagreement among these printed authorities.

读任何数据库时都要记住：数字记录的 publication 或 idno 字段几乎总会包含一个 CIL / IG / AE / SEG 号码。这是数字项目在说："关于这块石头的权威印本转写，见此。"数字文本中的任何模糊往往可追溯到这些印本权威之间的分歧。

An actual page from each pole两极各一页

CIL X 7296 — Mommsen 1883 — CIL X 7296 (Mommsen, 1883). The Latin pole.CIL X 7296（Mommsen，1883）。拉丁极。

IG XIV 297 — Kaibel 1890 — IG XIV 297 (Kaibel, 1890). The Greek pole.IG XIV 297（Kaibel，1890）。希腊极。

Both are the same physical stone — CIL X 7296 = IG XIV 297. Same object, two corpora, two editorial styles, ~7 years apart in publication. → Full retention/loss analysis in the Case Study

两者是同一块石头，CIL X 7296 = IG XIV 297。同一对象，两套集成，两种编辑风格，相隔约 7 年出版。→ 完整保留／丢失分析在案例版

Case study · 案例 — DDbDP cde.85.247

A papyrus body close-read — a tax receipt from 144 CE

一份莎草纸本文细读，公元 144 年的税收回执

A real DDbDP record (cde.85.247 / TM 140713) — a tax receipt from the Fayum, dated by the regnal year of Antoninus Pius. Its body shows a different style of EpiDoc: dense Leiden mark-up, expansion tags, certainty markers, all packed into eight lines.

真实 DDbDP 记录（cde.85.247 / TM 140713）—— 来自法雍的一份税收回执，以安东尼努斯・庇乌斯在位年份定年。其本文展示了 EpiDoc 的另一风格：密集的 Leiden 标记、缩略展开、置信度标记，挤在八行之内。

<ab>
<lb n="1"/>ἔτους ἐβδ<supplied reason="lost">ό</supplied>μου
       <unclear>Ἀ</unclear><supplied reason="lost">ν</supplied>
       <unclear>τω</unclear>νείνου

<lb n="2"/><choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice>
       τοῦ κυρίου Μεχεὶρ <num value="23"><hi rend="supraline">κγ</hi></num>

<lb n="3"/><expan>διέγρ<ex>αψε</ex></expan> Ἀρείῳ
       <expan>ἐγλ<unclear>ή</unclear>μπ<ex>τορι</ex></expan>
       <choice><reg><expan>χειρωναξίο<ex>υ</ex></expan></reg>
                <orig><expan>χειροναξίο<ex>υ</ex></expan></orig></choice>
</ab>

What's happening line-by-line:

逐行解读：

<lb n="1"/> — line break, this is line 1.
<lb n="1"/>，换行，本行为第 1 行。
<supplied reason="lost">ό</supplied> — the editor inserts the omicron because the papyrus is broken there.
<supplied reason="lost">ό</supplied>，编辑补 ό，因为这里莎草纸已残。
<unclear>Ἀ</unclear> — the alpha is on the papyrus but only partially legible.
<unclear>Ἀ</unclear>，α 在纸上但仅部分可读。
<choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice> — the scribe wrote "Καίσαξος" (a slip), the editor regularises to "Καίσαρος."
<choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice>，抄手误写为 "Καίσαξος"，编辑规范为 "Καίσαρος"。
<num value="23"><hi rend="supraline">κγ</hi></num> — Greek numeral κγ (= 23) with an overline (the diagnostic mark for "this is a number, not a word").
<num value="23"><hi rend="supraline">κγ</hi></num>，希腊数字 κγ（= 23），上有横线（"这是数字而非单词"的标记）。
<expan>διέγρ<ex>αψε</ex></expan> — abbreviated as "διέγρ" on the papyrus; editor expands to "διέγραψε" (he/she paid).
<expan>διέγρ<ex>αψε</ex></expan>，纸上缩写为 "διέγρ"；编辑展开为 "διέγραψε"（"已支付"）。

Three layered information types in eight lines: the surface text (what the scribe wrote), the editor's reading (what they think the scribe meant), and the editorial certainty per character (lost / unclear / present). Translate this to flat text and you collapse all three. This is exactly what an aggregator like EDCS does to its records.

八行内三层信息：表面字（抄手所写）、编辑的释读（认为抄手"应该"写的）、按字符的编辑置信度（缺/不清/在）。把它转换成纯文本就把三层都塌成一层。这正是 EDCS 等聚合器对其记录所做的。

Genres · 体裁

Five inscription genres — how databases tag them

五种铭文体裁，数据库如何标注

Different inscription genres demand different fields. EpiDoc encodes "what kind of text" via <rs type="textType"> in the title and <keywords> in the textClass. Here are the five most common in Mediterranean epigraphy:

不同体裁的铭文需要不同字段。EpiDoc 通过标题中的 <rs type="textType"> 与 textClass 中的 <keywords> 来编码"是什么类型的文本"。地中海铭文学中最常见的五类：

Genre体裁	Hallmark特征	Example database tags数据库标签示例	Local sample本地样本
Building inscription	commemorates construction or restoration纪念建造或修复	`textType="building"`	IRCyr C24400 (entablature)
Dedication	offering to a god, honour, or person奉献给神、荣誉或人	`textType="dedication"` + `persName type="divine"`	IGCyr 002200 (kothon graffito)
Epitaph	funerary; names the deceased墓志；命名逝者	`textType="epitaph"` · EDH "tit. sep."	EDCS has thousandsEDCS 中数以千计
Caption	labels an image (mosaic, fresco)为图像作题（马赛克、壁画）	`textType="caption"` + iconography tags	IRCyr A.1 (Noah caption)
Honorific	honours a public figure or benefactor为公共人物或恩主立铭	`textType="honorific"` + persName "honoree"	IAph 010001 (J. Caesar reference)
Workshop sign	advertises craft services广告匠铺业务	`textType="workshop"` (rare)	I.Sicily ISic000470 (CIL X 7296)
Curse tablet	defixio; private ritual咒符；私人仪式	`textType="curse"` · TAM has many	— not in local sample —— 本地样本中无 —
Document	contract, receipt (mostly papyri)合同、回执（多为莎草纸）	DDbDP "document"	cde.85.247 (tax receipt)

What this changes about the database choice: if you ask a question by genre, choose a database that tags genre. EDH has a strong genre vocabulary; EDCS has weaker but still has inscr_type; PHI has none — you'd have to infer genre from the text itself. Genre questions are also where TEI EpiDoc projects' textType + iconography combination really shines.

这如何影响"该用哪个库"：若问题以体裁为单位，应选体裁标签强的库。EDH 体裁词表较强；EDCS 较弱但仍有 inscr_type；PHI 无，只能从文本推断。体裁类问题也最能展现 TEI EpiDoc 项目 textType + iconography 标签组合的价值。

Bilingualism · 双语

Bilingual inscriptions — how databases handle two languages on one stone

双语铭文，数据库如何处理一石两语

CIL X 7296 / IG XIV 297 — bilingual stonecutter's plaque — The case stone: CIL X 7296 / IG XIV 297. Greek column on left, Latin column on right, separated by a deeply incised vertical groove. *One physical object; databases will model it four different ways.*本案例石头：CIL X 7296 / IG XIV 297。左为希腊栏，右为拉丁栏，中间深凹槽。*同一实物；数据库会以四种不同方式建模。*

Bilingual inscriptions are common around the Mediterranean: Greek + Latin, Punic + Latin, Hebrew + Greek. EpiDoc handles them with parallel <div type="edition" xml:lang="…"> blocks, one per language. Aggregators handle them less gracefully.

地中海地区双语铭文很常见：希腊文＋拉丁文、布匿文＋拉丁文、希伯来文＋希腊文。EpiDoc 用并列的 <div type="edition" xml:lang="…"> 区块（每种语言一份）来处理。聚合器处理得不那么优雅。

Database数据库	Bilingual handling双语处理	Loss丢失
I.Sicily	Two `<div>` blocks, one per language; full editorial markup on each两个 `<div>` 块，每语种一个；各自完整编辑标记	none
EDCS	Both halves concatenated into one `inscription` field; the `//` separator marks the boundary两半并入同一 `inscription` 字段；以 `//` 分隔	structural separation lost
EDH	Single transcription field; primary language flagged in `language`单一转写字段；以 `language` 标主语种	secondary language often demoted
PHI	Records the Greek face only; the Latin half goes to a separate database (or no record)只录希腊面；拉丁面去别处（或无记录）	Latin half lost
EDR	Both halves; Italian-language commentary explains the bilingualism两半皆录；意大利语 commento 解释双语现象	structural distinction in prose only
OSU KB	Photograph; no transcription; bibliographic relation field cites both names照片；无转写；bibliographic relation 字段同时引两个名	no text at all

The case study you've already met (CIL X 7296 / IG XIV 297): a bilingual stonecutter's sign. I.Sicily holds it as one record with two <div> blocks; PHI splits it into two records (PHI 175744 + PHI 140601, neither linking to the other); EDCS holds both halves as one row with a // separator. The same physical stone is data-modelled four different ways.

你已熟悉的案例（CIL X 7296 / IG XIV 297）：一块双语石匠铺招牌。I.Sicily 用一条记录、两个 <div> 块容纳；PHI 拆成两条记录（PHI 175744 + PHI 140601，互不链接）；EDCS 把两半放在同一行，用 // 分隔。同一块石头，被四种不同方式建模。

Practical implication: if you query "how many bilingual inscriptions in EDCS?" you have to write a regex that detects the // separator. If you query "how many bilingual in PHI?" you can't — PHI's records are language-specific, so the two halves of one stone count as two separate records.

实际意味：若问"EDCS 中有多少双语铭文"，得写正则识别 // 分隔符。若问"PHI 中有多少双语"，则无法，PHI 记录按语种切分，同一块石头的两半被计为两条独立记录。

Bonus · 旁征

The wider TEI EpiDoc world — things you'll meet next door

更广的 TEI EpiDoc 世界，隔壁就会遇到的项目

If you work with inscriptions, you will sooner or later run into these adjacent projects. They share the same XML schema and bracket conventions; the data flows between them with surprisingly little glue.

做铭文研究迟早会撞上这些邻居项目。它们共享同一套 XML 模式与括号惯例；数据在它们之间几乎不需要"胶水"就能流动。

Greek lit corporacanonical-greekLit + First1KGreekcanonical-greekLit + First1KGreek

~1,000 first-millennium Greek authors in TEI XML, hosted on GitHub. Backbone of Scaife Viewer + Perseus.

约 1,000 位公元前一千年希腊作家的 TEI XML 全集，托管 GitHub。Scaife Viewer + Perseus 的主干。

Latin lit corporacanonical-latinLit + Lace2canonical-latinLit + Lace2

Latin literary corpora in TEI XML. Source of training data for Latin BERT and similar models.

TEI XML 中的拉丁文文学语料。Latin BERT 等模型的训练源。

Christian LatinCSEL-devCSEL-dev

Corpus Scriptorum Ecclesiasticorum Latinorum — patristic Latin in TEI.

CSEL — 教父拉丁语料的 TEI 版。

Reading interfaceScaife ViewerScaife Viewer

Web reader for canonical-greekLit / latinLit. Cite by URN; select a passage; share. The expected reading interface for these corpora.

canonical-greekLit / latinLit 的网页阅读器。URN 引用、段落选择、分享。这些语料的标准阅读界面。

Coin metadatanomisma.orgnomisma.org

Linked-data project for ancient coinage. Adjacent vocabulary for many inscription joins (mints, denominations).

古代货币的关联数据项目。在铸厂、币值等问题上与铭文项目共享词表。

GeographyPleiades + ToposTextPleiades + ToposText

Pleiades = canonical place identifiers for the Mediterranean. ToposText = ancient texts mapped to those places. Most regional EpiDoc projects link out to Pleiades for findspot.

Pleiades 是地中海地名的标准 ID 集；ToposText 把古代文本映射到这些地名。绝大多数区域 EpiDoc 项目把"出土地"链到 Pleiades。

Architecture · 架构 — papyri.info

Inside papyri.info — how a federated portal actually works

papyri.info 内部，一个联邦门户的实际工作机制

A look at how the papyri.info portal stitches together DDbDP texts, HGV metadata, and APIS images. The same architectural pattern is what inscription federation projects (EAGLE) aspire to.

看 papyri.info 如何把 DDbDP 文本、HGV 元数据、APIS 图像缝合在一起。这正是铭文联邦项目（EAGLE）所向往的架构。

          USER                        PAPYRI.INFO BACKEND
          ──────                       ──────────────────
                                       ┌────────────────────┐
   ┌────────────┐    HTTP GET          │    Solr index      │
   │  Browser   │ ───────────────────▶ │  (full-text + IDs) │
   └────────────┘                      └─────────┬──────────┘
                                                 │
                                       resolves TM-ID against:
                                                 │
                          ┌──────────────────────┼─────────────────────┐
                          ▼                      ▼                     ▼
                 ┌──────────────┐       ┌──────────────┐       ┌──────────────┐
                 │  DDbDP repo  │       │   HGV repo   │       │   APIS repo  │
                 │  TEI XML     │       │  TEI XML     │       │  IIIF + DC   │
                 │  (text)      │       │  (metadata)  │       │  (images)    │
                 └──────────────┘       └──────────────┘       └──────────────┘
                  Github / Duke         Heidelberg            many universities

The trick: the user search and the data store are decoupled. Solr indexes everything (text + metadata + image presence) and resolves results by TM-ID. Whichever repository physically holds a given record, the portal links you to it. Editing an XML file in the GitHub repository feeds back into the portal within minutes.

关键点：用户检索与数据存储是解耦的。Solr 索引一切（文本＋元数据＋图像存在性），按 TM-ID 解析结果。无论某条记录实际由哪家机构保存，门户都会把你链向它。在 GitHub 仓库编辑一份 XML 文件，几分钟后改动即在门户上可见。

What this required: (1) every record must have a TM-ID; (2) all three repositories must use the same identifier scheme; (3) the index must be rebuilt when source files change; (4) one institution must own the portal's uptime. Inscriptions don't have all four yet. EAGLE has aspired to (1) and (2); (3) and (4) are still in negotiation.

这要求：(1) 每条记录必须有 TM-ID；(2) 三个仓库必须共用同一标识符方案；(3) 源文件变动时索引必须重建；(4) 必须有一家机构对门户在线率负责。铭文界尚未同时具备这四点。EAGLE 已争取 (1)(2)；(3)(4) 还在谈。

Versions · 版本

Encoding versions move — P4, P5, EpiDoc 8, EpiDoc 9

编码版本会变，P4、P5、EpiDoc 8、EpiDoc 9

A real concern when working across projects: not all TEI is the same TEI, and not all EpiDoc is the same EpiDoc. The schema has gone through several major revisions. A query that works on one project's records may fail on another.

跨项目工作时的真实问题：不是所有 TEI 都是同一种 TEI，也不是所有 EpiDoc 都是同一种 EpiDoc。schema 已历经数次大修。在某项目记录上能用的查询，到另一项目可能失败。

Version版本	Year年份	What changed变化	Local sample本地样本
TEI P4	2002	DTD-based; original EpiDoc DTD基于 DTD；最初的 EpiDoc DTD	`tei-epidoc.dtd` in your folder
TEI P5	2007	RNG; major restructuring; namespaces cleanRNG；大重构；命名空间清晰	most current files
EpiDoc Guidelines 8	2010	migration to TEI P5; many epigraphic projects converted around this time迁移至 TEI P5；许多铭文项目在此时前后转换	IRT0001 revisionDesc records "2010-08-18 Converted from TEI P4 to P5"
EpiDoc Guidelines 9	2018	refinements; explicit handling of multilingualism, certainty精细化；多语种、置信度的显式处理	IGCyr files cite v8; IRT now cites v9.3
EpiDoc Guidelines 10	~2024	in development; better integration with linked-data vocabularies开发中；与关联数据词表更好整合	not yet seen in your sample

Practical impact: when you write a query that depends on a specific tag (e.g. <origDate notBefore="…">), it works on EpiDoc 8/9 but might fail on TEI P4 records (which use a different attribute). The honest move is to declare which schema version your query targets, and version-check before running.

实际影响：若查询依赖某具体标签（如 <origDate notBefore="…">），在 EpiDoc 8/9 上可用，但可能在 TEI P4 记录上失败（属性不同）。诚实做法是显式声明查询所针对的 schema 版本，并在运行前做版本检查。

Tools · 工具

The toolkit that makes it readable — EFES, Mirador, ITHACA, Latin BERT

让它们可读的工具集，EFES、Mirador、ITHACA、Latin BERT

A TEI XML file is hard to read directly. The ecosystem of tools that turn EpiDoc XML into HTML, into a reading interface, into a IIIF viewer, into a search index, into model training data — is what makes the data usable.

直接读 TEI XML 并不容易。把 EpiDoc XML 变成 HTML、阅读界面、IIIF 查看器、搜索索引、模型训练数据，这一整套工具生态，才让数据真正可用。

Reader EFESEFES

EpiDoc Front-End Services. A static-site generator that turns a folder of EpiDoc XML into a browsable website. Used by I.Sicily, IRT, IRCyr.

EpiDoc 前端服务。把一文件夹的 EpiDoc XML 编译为可浏览的静态网站。I.Sicily、IRT、IRCyr 都在用。

github.com/EpiDoc/EFES ↗

Image viewer MiradorMirador

A IIIF-aware image viewer. Drop in a IIIF manifest URL; you get deep-zoom, multi-window comparison, and annotations.

IIIF 感知的图像查看器。投入一个 IIIF manifest URL，你即得深缩放、多窗对比、批注功能。

projectmirador.org ↗

Image viewer Universal ViewerUniversal Viewer

Alternative IIIF viewer. Lighter; embedded in many digital library catalogues.

另一款 IIIF 查看器。更轻量；嵌入许多数字图书馆目录。

universalviewer.io ↗

Reading interface Scaife ViewerScaife Viewer

Reads canonical-greekLit + canonical-latinLit corpora. Use URN to cite a passage; share clean URLs.

读 canonical-greekLit 与 canonical-latinLit 语料。以 URN 引段；可分享干净 URL。

scaife-viewer.org ↗

ML model IthacaIthaca

DeepMind's Greek-inscription transformer. Proposes restorations, geographic attribution, dates. Web demo at ithaca.deepmind.com.

DeepMind 的希腊铭文 transformer。可给出修复、地理归属、定年建议。网页演示在 ithaca.deepmind.com。

ithaca.deepmind.com ↗

ML model Latin BERTLatin BERT

A BERT-style language model trained on classical Latin. Useful for similarity search and named-entity recognition on inscriptions and literature.

在古典拉丁语料上训练的 BERT 风格语言模型。适于在铭文与文献上做相似度检索与命名实体识别。

github.com/dbamman/latin-bert ↗

Mapping Pelagios + RecogitoPelagios + Recogito

Annotation platform for ancient places. Lets you tag texts with Pleiades place URIs and produce LD-ready annotations.

古代地名的标注平台。可用 Pleiades 地名 URI 给文本打标签，产生关联数据-就绪的标注。

recogito.pelagios.org ↗

ETL SDAM ETL pipelinesSDAM ETL 管道

37 Aarhus repositories that download, parse, clean, and merge inscription databases into analysable parquet. Python + R + Jupyter.

奥胡斯大学 37 个仓库，下载、解析、清洗、合并铭文数据库，输出可分析的 parquet。Python + R + Jupyter。

github.com/sdam-au ↗

Choosing · 选哪个

Which database for which question — a small decision tree

什么问题用哪个库，一棵小决策树

Your question你要问的	Start here从这里开始	Why为什么
"How often does X appear in Latin inscriptions?""X 在拉丁铭文中出现多频繁？"	EDCS	Largest Latin coverage; conservative + interpretive cleaning; partner_link bridges拉丁覆盖最广；含保守 + 解释清洗；有 partner_link
"Distribution of inscription types across Roman provinces""罗马行省内的铭文类型分布"	EDH	Province-aware metadata; Heidelberg keywords vocabulary行省感知的元数据；Heidelberg 关键词体系
"Greek string search across all known Greek inscriptions""在所有已知希腊铭文上做希腊字串搜索"	PHI	Largest Greek text coverage; opaque numeric IDs; no images希腊文本覆盖最大；ID 不透明；无图
"Material + dimensions + lemma analysis on a regional corpus""区域语料的材质 + 尺寸 + lemma 分析"	regional EpiDoc	TEI EpiDoc deep encoding; structured fields, citable per inscriptionTEI EpiDoc 深度编码；字段结构化、按条目可引
"Comparison with Italian regional inscriptions""与意大利区域铭文做对比"	EDR	Italian-language metadata; partner of EAGLE federation意大利语元数据；EAGLE 联盟成员
"Train a model on Greek epigraphic text""用希腊铭文文本训练模型"	iPHI / Ithaca	Pre-tokenized, deduplicated, train/val/test splits已切分 token、已去重，含训练 / 验证 / 测试集
"Cross-reference an inscription with documentary papyri""把铭文与莎草纸文献交叉引用"	papyri.info + Trismegistos	TM-ID is the join key; APIS images on the same recordTM-ID 即联接键；APIS 图像同条记录可见
"Squeeze archives + photographs for a specific stone""某块石头的压模档案 + 照片"	EDH (squeezes) + UbiErat Lupa (photos) + OSU KB	Squeeze archives are EDH's specialty; lupa.at is image-first; OSU CEPS for selected stones压模档案是 EDH 的专项；lupa.at 以图为本；OSU CEPS 选录若干石

Honest practice: for non-trivial questions, you will end up using two or three databases together. Plan the join before you plan the analysis.

诚实的做法：遇到非平凡的问题，你最终会同时用两到三个数据库。先把联接规划好，再去做分析。

The atlas · 总图

~30 databases — at a glance

约 30 个数据库，一图概览

Database数据库	Family家族	Encoding编码	~Records条数	Image图像	Licence授权
EDCS	1 Aggregator	flat record	~537,000	26.3 %	site
EDH	1 Aggregator	structured	~82,000	photo + squeeze	CC BY-SA
Ubi Erat Lupa	1 Aggregator (image)	image-first	~50,000+	core asset	site
I.Sicily	2 Regional EpiDoc	TEI EpiDoc	~3,500	IIIF	CC BY 4.0
IRT	2 Regional EpiDoc	TEI EpiDoc	~1,618	JPG	CC BY 2.0
IRCyr	2 Regional EpiDoc	TEI EpiDoc	~2,360	JPG	CC BY 2.0
IGCyr / GVCyr	2 Regional EpiDoc	TEI EpiDoc	~1,500	selective	CC BY-NC 4.0
IAph 2007	2 Regional EpiDoc	TEI EpiDoc + lemma	~1,489	JPG	CC BY 2.5
IOSPE	2 Regional EpiDoc	TEI EpiDoc bilingual	varies	selective	CC BY 4.0
InsAph / Vindolanda	2 Regional EpiDoc	TEI EpiDoc	~700	multispectral	CC BY-NC 4.0
AshLI	2 Regional EpiDoc	TEI EpiDoc	~250	IIIF	CC BY-SA
MAMA XI	2 Regional EpiDoc	TEI EpiDoc	~400	selective	CC BY-NC-SA
PHI Greek	3 Text-only	plain text	~250,000	none	site
McCabe corpora	3 Text-only	plain text	tens of K	none	site
EDR	4 EAGLE	structured / IT	~80,000	JPG	EAGLE
EDB	4 EAGLE	structured	~40,000	JPG	EAGLE
EDV	4 EAGLE	structured	varies	JPG	EAGLE
DDbDP	5 Papyri	TEI EpiDoc	~60,000	linked to APIS	CC BY 3.0
HGV	5 Papyri	TEI EpiDoc (metadata)	~70,000	— (metadata only)	CC BY 3.0
APIS	5 Papyri	institutional	~60,000	core asset	varies
DCLP	5 Papyri	TEI EpiDoc	~10,000+	linked	CC BY 3.0
Trismegistos	5 Papyri (resolver)	relational + RDF	~800,000+	— (joins only)	site / CC
papyri.info	5 Papyri (federated)	TEI + JSON-LD	federated	via APIS	CC BY 3.0
iPHI / Ithaca	6 ML	tokenized text + labels	~178,000 (post-dedup)	none	Apache 2.0 / CC BY
OSU KB / CEPS	institutional photo	Dublin Core	limited	single JPG / item	institutional
Pleiades	place gazetteer	JSON-LD	~40,000 places	— (places)	CC BY 3.0
EAGLE network	federation portal	resolver + vocabularies	n/a	n/a	varies

Counts are approximate (most rows update over time). The "Family" column lets you read this table as a folded version of the previous slides.

条数为近似值（多数随时间更新）。"家族"一列让你能把此表读作前几张幻灯的折叠版。

Recap · 收束

Three things to keep when you close this atlas

合上这本图册时请带走的三件事

Inscription databases are families, not a monolith. The six families have different goals, different encodings, different identifier systems, and different licences.
铭文数据库是若干家族，不是一块铁板。六大家族目标、编码、标识符、授权各异。
The TEI EpiDoc spine connects regional projects, papyri databases, and Greek/Latin literary corpora. If you learn to read it, you can read across the field.
TEI EpiDoc 是连接区域项目、莎草纸库、希腊／拉丁文学语料的共同主干。学会读它，就能横读整个学科。
For non-trivial questions, plan the join before you plan the analysis. Which databases? Which join key? Where will the join fail, and how will you know?
遇到非平凡问题，先规划联接，再规划分析。用哪些库？以哪个 ID 联？联在哪里会断？怎么知道断了？

Where next: the Database Literacy tutorial walks one stone across six landings; the Case Study shows seven concrete data-disparity issues; the Visual edition introduces the SDAM ETL pipelines that transform aggregator data into analysable parquet; the Reference covers all 37 SDAM repositories.

下一站：数据库识读把一块石头沿六个落点走一遍；案例展示七个具体的数据差异问题；视觉版介绍 SDAM ETL 管道；参考版覆盖 SDAM 全部 37 个仓库。

Sources · 资料来源

Project URLs & key sources

项目网址与重要来源

Aggregators & federation

Regional EpiDoc

Text-only

Italian / EAGLE

Papyri

ML / NLP

Encoding & tooling

编码与工具

EpiDoc Guidelines ↗ · TEI XML for inscriptions
Leiden conventions ↗ · the printed bracket system
Pleiades ↗ · place identifiers
Scaife Viewer ↗ · canonical Greek + Latin reading interface
github.com/sdam-au ↗ · the SDAM ETL pipelines (37 repos)

Format observations on this atlas were drawn directly from sample TEI XML files in the local epidoc folder (Aphrodisias, IRT, IRCyr, IGCyr, IOSPE, DDbDP) and from the SDAM ETL bulk dumps for EDH and EDCS.

本图中的格式观察直接取自本地 epidoc 文件夹中的 TEI XML 样本（Aphrodisias、IRT、IRCyr、IGCyr、IOSPE、DDbDP），以及 SDAM ETL 中 EDH 与 EDCS 的批量导出。

Where the inscriptions actually live.

铭文 真正 居住在哪里。

Behind every claim about "N inscriptions show…" stands a database, an encoding, a license, a scope

每一句"N 件铭文显示……"的背后，都站着一个数据库、一种编码、一份授权、一块范围

What is XML?

什么是 XML？

The structure

The vocabulary

What are JSON, CSV, and tabular data?

什么是 JSON、CSV 与表格型数据？

CSV

JSON

Parquet

What is a schema? — RNG, DTD, Schematron

什么是 schema？RNG、DTD、Schematron

What you actually see at the top of an EpiDoc file

URI, URL, handle, DOI, ARK — names that should not break

URI、URL、handle、DOI、ARK，不该断的名字

What is an API? — REST, OAI-PMH, SPARQL

什么是 API？REST、OAI-PMH、SPARQL

REST + JSON

OAI-PMH

SPARQL endpoint

Direct download

What is a controlled vocabulary?

什么是 受控词表？

EpiDoc taxonomies

EAGLE Vocabularies

Six families — read this, then the rest of the atlas tiles in

六大家族，读完这页，全图就立起来了

The wide-coverage databases

广覆盖数据库

EDCS

EDH

UBI ERAT LUPA

The deep, narrow corpora

深而窄的语料库

The <msDesc> block — how the stone is described

<msDesc> 区块，石头是如何被描述的

<physDesc> — material, dimensions, lettering

<physDesc>，材质、尺寸、字形

<layoutDesc> — where the letters sit on the surface

<layoutDesc>，字母在表面的位置

<history> — when made, when found, where now

<history>，何时所作、何时被发现、现存何处

Certainty markers — honest about what we don't know

置信度标记，对未知保持诚实

Named entities — people, places, gods

命名实体，人、地、神

The apparatus criticus — recording editorial disagreements

异文校勘记，记录编辑间的分歧

<revisionDesc> — the digital record's biography

<revisionDesc>，数字记录自身的传记

A regional EpiDoc record up close — IRT0001

看一份区域 EpiDoc 记录，IRT0001

The deep encoding extreme — word-by-word lemmatization in IAph

深度编码的极致，IAph 中逐词的 lemma 标注

A Christian mosaic caption — "Noah and the Dove" at Apollonia

基督教马赛克 caption，阿波罗尼亚的"诺亚与鸽"

Identifiers

Genre + content

Layout — text and image as a single iconographic field

Dating — sixth century, by mosaic style

Keywords — formal classification

An archaic graffito — a kothon scratched ca. 500 BCE

一段古风涂刻，约公元前 500 年的 kothon 刻文

physDesc — fragmentary support

handDesc — letter forms as evidence

execution = "graffito" — a controlled-vocabulary tag

provenance — three states

A bilingual EpiDoc — IOSPE publishes everything in Russian + English

一个双语 EpiDoc，IOSPE 把一切都同时用俄语与英语发表

PHI and the find-the-string family

PHI 与"找字串"家族

PHI Greek Inscriptions

Searchable Greek Inscriptions (Cornell-McCabe)

The EAGLE federation — Italian-language partners

EAGLE 联盟，意大利语合作伙伴

A parallel TEI EpiDoc world — papyrology

TEI EpiDoc 的平行宇宙，莎草纸学

铭文真正居住在哪里。

什么是受控词表？

The `<msDesc>` block — how the stone is described

`<msDesc>` 区块，石头是如何被描述的

`<physDesc>` — material, dimensions, lettering

`<physDesc>`，材质、尺寸、字形

`<layoutDesc>` — where the letters sit on the surface

`<layoutDesc>`，字母在表面的位置

`<history>` — when made, when found, where now

`<history>`，何时所作、何时被发现、现存何处

`<revisionDesc>` — the digital record's biography

`<revisionDesc>`，数字记录自身的传记