A guided tour of the major projects that publish ancient-Mediterranean inscriptions on the web — the families they belong to, the encoding conventions they use, the identifiers that name them, and the seams where they fail to fit together.
古地中海铭文在网络上的主要发布项目导览,它们所属的"家族"、采用的编码惯例、用以命名记录的标识符系统,以及它们彼此对接处的缝隙。
25 slides · 6 families · ~30 databases · use ← → to navigate · click any underlined term
25 张幻灯 · 6 个家族 · 约 30 个数据库 · 使用 ← → 翻页 · 点击带下划线的术语
When you read a sentence like "of 12,000 surviving inscriptions from Roman North Africa, 38% mention an emperor", the number 12,000 is the residue of a chain of choices — which database the author queried, which fields they filtered, which encoding the editors used to mark "imperial reference," what license allowed bulk download, and which inscriptions fell outside the chosen scope.
当你读到一句话:"罗马北非现存的一万两千件铭文中,有 38% 提及元首",这个一万二的数字背后,是一条选择链:作者查的是哪个数据库?过滤了哪些字段?编辑用什么编码标记"元首指称"?什么授权允许了批量下载?哪些铭文落在了所选范围之外?
The thesis of this atlas: there is no single "the inscription database." There are families of projects with different goals (regional surveys, aggregators, text-search, ML training), different encodings (TEI EpiDoc, plain text, Dublin Core), different identifier schemes, and different licenses. Knowing the map helps you know what you can responsibly say.
本地图的论点:不存在"那个"铭文数据库。世界上有若干 家族:它们目标不同(区域调查、聚合、文本检索、机器学习训练)、编码不同(TEI EpiDoc、纯文本、Dublin Core)、标识符系统不同、授权不同。看懂地图,才能知道哪些话可以负责任地说。
This atlas covers the major projects that the SDAM ETL pipelines (and most digital epigraphy work) draw on. It does not claim completeness — there are smaller and newer projects — but the families it identifies are stable.
本图覆盖 SDAM ETL 管道(以及大多数数字铭文学工作)所依赖的主要项目。它不声称完备,还有更小、更新的项目,但所标识的"家族"分类是稳定的。
Disciplinary lineage (skim before reading the families): the print authority chain from CIL (Mommsen 1853–) and IG (Boeckh 1828–) → PHI CD-ROMs (1985) → web databases (EDH, EDCS, 1995–) → TEI EpiDoc + IIIF (2005–) → open-data + reproducibility (FAIR, 2015–). → Full lineage walk-through in Paper § 1.5学科谱系(在读各家族前略览):印本权威链 CIL(Mommsen 1853–)与 IG(Boeckh 1828–)→ PHI 光盘(1985)→ Web 数据库(EDH、EDCS,1995–)→ TEI EpiDoc + IIIF(2005–)→ 开放数据 + 可复现(FAIR,2015–)。→ 完整谱系见论文版 § 1.5
XML (eXtensible Markup Language) is a way of writing structured data so that both humans and computers can read it. Every piece of content sits inside a labelled box — a tag — and tags can contain other tags, like Russian dolls. The labels carry meaning: a computer reading <date>1883</date> knows the number is a date, not just a string.
XML(可扩展标记语言)是一种把结构化数据写成人与计算机都能读懂的格式。每段内容都装进一个有"标签"的盒子里,盒子可以套盒子,像俄罗斯套娃。标签本身携带语义:计算机读到 <date>1883</date>,就知道里头的数字是"年份",而不仅仅是一串字符。
<inscription>
<title>Stonecutter's sign</title>
<date>1st century CE</date>
<text language="latin">
Tituli heic ordinantur
</text>
<text language="greek">
στῆλαι ἐνθάδε τυποῦνται
</text>
</inscription>
<title>…</title><title>…</title>language="latin"language="latin"<lb n="2"/><lb n="2"/>tei:bodytei:bodyWhy epigraphy uses XML: editorial markup needs more than plain text. You need to mark "this letter was supplied by the editor" or "the date is between 117 and 138 CE with low confidence." Plain text cannot do this; XML can.
铭文学为何选 XML:编辑标记需要的不只是纯文本。你需要标"这一字母为编辑所补",或"年代为公元 117–138 年,低置信度"。纯文本做不到,XML 可以。
Note for low-coding-literacy readers: XML files are just text files. You can open them in any text editor (Notepad, TextEdit, VS Code). The angle brackets are not magic — they are just a convention so a computer can find where one piece of information ends and the next begins.
给非程序员的提示:XML 文件就是纯文本文件,任何文本编辑器(记事本、TextEdit、VS Code)都能打开。尖括号没有魔法,只是一种约定,让计算机知道一段信息从哪开始、到哪结束。
Comma-Separated Values. The simplest tabular format: one row per line, columns separated by commas. Excel and pandas read it natively.
逗号分隔值。最简单的表格格式:一行一条记录,列用逗号分隔。Excel 和 pandas 都能直接读。
id,province,date_from,date_to EDCS-22000882,Sicilia,-100,100 EDCS-22000883,Africa,150,200
JavaScript Object Notation. Records as key-value dictionaries; nested allowed. Good for APIs and structured records that don't fit a flat table.
JavaScript 对象表示法。记录写成"键-值"字典,可嵌套。适合 API 与"摆不平为表格"的结构化记录。
{
"id": "EDCS-22000882",
"province": "Sicilia",
"date": {"from": -100, "to": 100},
"text": ["Tituli heic", "στῆλαι ἐνθάδε"]
}
A binary tabular format optimised for analysis at scale. SDAM's distribution format. Loads ~10× faster than CSV; readable in pandas, Arrow, R.
为大规模分析优化的二进制表格格式。SDAM 的分发格式。比 CSV 加载快约 10 倍;pandas、Arrow、R 都能读。
import pandas as pd df = pd.read_parquet( 'LIST_v0-3.parquet' ) df.shape # (~600000, 50)
The trade-off vs XML: CSV/JSON/Parquet are flatter than XML. They lose nested editorial markup but gain massive speed and direct loadability into analysis tools. Most "data analysis on inscriptions" works on a CSV/Parquet derivative; the original TEI XML stays as ground truth.
与 XML 的取舍:CSV / JSON / Parquet 比 XML 扁平。它们丢失了嵌套的编辑标记,但获得了极快的加载速度,能直接进分析工具。绝大多数"对铭文做的数据分析"实际是在 CSV / Parquet 衍生版上跑的;TEI XML 原件留作"真值"。
Real-world example: the SDAM ETL pipeline takes EDCS's web records, parses them, runs regex cleaning, and outputs EDCS_merged_cleaned_attrs_2022-09-12.json (537,262 records, 50 fields). That JSON dump is what every downstream analysis loads. The TEI XML remains in I.Sicily / IRT / IRCyr GitHub repos for any reader who wants to verify the pipeline's choices.
真实例子:SDAM ETL 管道抓取 EDCS 网页记录、解析、跑正则清洗,输出 EDCS_merged_cleaned_attrs_2022-09-12.json(53.7 万条、50 字段)。下游所有分析都是基于这份 JSON。TEI XML 原件保留在 I.Sicily / IRT / IRCyr 的 GitHub 仓库中,供想核查管道决策的读者使用。
A schema is a contract that says: "an XML document is valid if and only if it follows these rules." It defines which elements may appear, in what order, with which attributes, and with which content. Without a schema, two editors will quietly use different conventions and the data will not be queryable.
schema(模式)是一份契约:"此 XML 文档若且仅若遵守这些规则才算有效"。它规定:哪些元素可以出现、以什么顺序、带哪些属性、内容是什么。没有 schema,两位编辑会暗中采用不同惯例,整套数据就无法查询。
| Schema language模式语言 | Used for用途 | In epigraphy在铭文学中 |
|---|---|---|
| DTD (Document Type Definition) | oldest; defines elements + attributes | Original TEI EpiDoc DTD (P4) — superseded but still seen in older files |
| RNG (RELAX NG) | more expressive; richer constraints | The current TEI EpiDoc schema. tei-epidoc.rng |
| Schematron | rule-based assertions ("if X, then Y must…") | Project-specific checking rules. e.g. ircyr-checking.sch |
| XSD (XML Schema) | W3C standard; strong typing | Used by some library systems, less common in epigraphy |
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/dev/tei-epidoc.rng"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/dev/ircyr-checking.sch"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
Reading this: the file declares two schemas. The first (RNG) says "this file follows the EpiDoc structure." The second (Schematron) says "this file additionally follows the IRCyr project's checking rules." A validator runs both — RNG catches structural errors; Schematron catches "you used a Cyrene placeRef but the geographic region must then be Cyrenaica."
读法:文件声明了两份 schema。第一份(RNG)说"本文件遵循 EpiDoc 结构";第二份(Schematron)说"本文件额外遵循 IRCyr 项目的检查规则"。校验器会同时跑两者,RNG 抓结构错误;Schematron 抓"你用了 Cyrene 的 placeRef,那么地理区域必须是 Cyrenaica"这类业务规则。
Why this matters: EpiDoc XML across different projects (I.Sicily, IRT, IRCyr) is interoperable because they all share the same RNG schema. Project-specific Schematron rules add local discipline without breaking that interoperability.
这为何要紧:不同项目(I.Sicily、IRT、IRCyr)的 EpiDoc XML 之所以可互操作,是因为它们共用同一份 RNG schema。项目自有的 Schematron 规则在不破坏互操作性的前提下,叠加了"本地纪律"。
A digital scholarly resource is only as durable as the name you cite it by. Different naming systems trade off readability, persistence, and institutional commitment.
数字学术资源的可持续性,取决于你引用它时所用的名字。不同的命名系统在"可读性""持久性""机构承诺"之间各有取舍。
| System系统 | Example例子 | Strength优点 | Weakness弱点 |
|---|---|---|---|
| URL (a web address) | http://sicily.classics.ox.ac.uk/inscription/ISic000470 | readable, clickable | breaks if the server moves; link rot is the rule |
| URI (broader concept) | any URL or URN | universal address | does not by itself promise resolution |
| Handle | hdl.handle.net/1811/99008 | institutional commitment to keep resolving | requires institution to maintain handle service |
| DOI | 10.1515/jdh-2021-1004 | publisher-backed, citable in scholarship | cost to register; mostly publishers, not databases |
| ARK | ark:/12345/foo | library-grade, no fees | less common in epigraphy than libraries |
| URN:CTS | urn:cts:greekLit:tlg0057.tlg010 | canonical citation for Greek/Latin literature | specific to the canonical-greekLit / Scaife world |
| TM-ID | TM 79368 | universal join key for ancient world | not a URL by itself; resolves at trismegistos.org |
| Pleiades URI | pleiades.stoa.org/places/462492 | canonical place identifier | limited to Mediterranean places |
For your own work: when citing an inscription in a paper, prefer the most stable identifier the database offers. For TEI EpiDoc projects, this is usually a project URI like http://sicily.classics.ox.ac.uk/inscription/ISic000470. For records that descend from print, also cite the print edition (CIL X 7296) — print citations remain stable when websites disappear.
对你自己的工作:论文中引用铭文时,优先选数据库提供的最稳定标识符。对 TEI EpiDoc 项目通常是 http://sicily.classics.ox.ac.uk/inscription/ISic000470 这样的项目 URI。对源自印本的记录,同时引印本(CIL X 7296)—— 网页消失时,印本引用仍稳。
The honest reality: a 2018 study found that ~10–15% of cited URLs in classical-studies articles are dead within 5 years. Schemes like DOI and handle exist to push that rate down — but they cost money and require institutional maintenance.
诚实的现实:2018 年一项研究发现,古典学论文中所引 URL 在 5 年内"失效"的比例约为 10–15%。DOI、handle 等方案就是为压低这个比例而存在,但需要持续的资金与机构维护。
An API (Application Programming Interface) is a way for one program to ask another for data, structured so the answer comes back in a predictable format. Most epigraphy databases offer one or more APIs alongside their human-readable web pages.
API(应用程序接口)是一个程序向另一个程序请求数据的方式,请求与响应都按可预测的格式来。大多数铭文数据库都在面向人类的网页之外提供一个或多个 API。
The most common API style. You send an HTTP GET to a URL; the server returns JSON. Used by I.Sicily, papyri.info, OSU Knowledge Bank.
最常见的 API 风格。你向某个 URL 发 HTTP GET,服务器返回 JSON。I.Sicily、papyri.info、OSU 知识库都用这种。
GET https://kb.osu.edu/server/api/core
/items/88887477-ad90-4c87-b9dd-…
→ {"name": "CIL 10.7296/IG 14.297…",
"metadata": {…},
"_links": {…}}
An older, library-grade API for harvesting full record sets in batches. Used by some archives and DSpace repositories.
较早的、图书馆级 API,用于按批量收割整套记录。一些档案库与 DSpace 知识库使用。
GET …/oai-pmh?verb=ListRecords
&metadataPrefix=oai_dc
→ <ListRecords>
<record>…</record>
<record>…</record>
</ListRecords>
A query language for linked data. You write a graph pattern and the server returns matching subjects. Pleiades and a few EAGLE-aware projects expose SPARQL.
关联数据的查询语言。你写一段图谱模式,服务器返回匹配项。Pleiades 与几个 EAGLE 感知项目提供 SPARQL endpoint。
SELECT ?place ?name WHERE {
?place rdfs:label ?name .
?place pleiades:type "settlement" .
?place geo:in_region pleiades:Sicily .
}
Many projects skip APIs entirely and publish bulk dumps as zip files on GitHub or institutional repositories. Less elegant, but reliably forever-archived.
许多项目干脆跳过 API,把批量数据以 zip 形式发布到 GitHub 或机构知识库。不够优雅,但更可靠地"永久存档"。
git clone https://github.com/
ISicily/ISicily
# → 3,500 TEI XML files,
# one per inscription
For SDAM and similar projects: the ETL pipelines combine all of these. They scrape HTML where there's no API, hit REST endpoints where one exists, harvest OAI-PMH from DSpace, and pull GitHub bulk dumps where projects publish them. No single API handles the field; ETL stitches the methods.
对 SDAM 等项目而言:ETL 管道把以上方法结合使用。无 API 处抓取 HTML,有 REST 处调 REST,DSpace 处用 OAI-PMH 收割,凡 GitHub 公开批量数据处直接拉取。没有任何一种单一 API 能覆盖整个领域;ETL 才是把它们缝合起来的工艺。
A controlled vocabulary is a fixed list of allowed values for a field. Instead of letting an editor type "limestone" or "Limestone" or "Lst." or "lapis", you say: "the allowed values for material are this list of 14 lithic types." Queries become reliable; spelling variants stop polluting counts.
受控词表是某字段允许取值的固定列表。与其让编辑随手写 "limestone"、"Limestone"、"Lst." 或 "lapis",不如规定:"字段 material 的允许取值是 下列 14 种岩性。"于是查询变得可靠,拼写变体不再污染计数。
<classDecl><taxonomy>
<category xml:id="photograph">
<catDesc>Digital or digitized
photographs</catDesc>
</category>
<category xml:id="representation">
<catDesc>Digitized other
representations</catDesc>
</category>
</taxonomy></classDecl>
The IRCyr project declares its own taxonomy of figure types inside the file's encodingDesc.
IRCyr 项目在文件 encodingDesc 内声明自有的"图像类型分类"。
A federation-wide effort. Five shared vocabularies (object types, materials, decorations, dating criteria, writing types), each with multilingual labels. Reused across EDR, EDB, EAGLE network records.
联盟级协作。五套共享词表(对象类型、材质、装饰、定年判据、书写类型),每套都有多语标签。被 EDR、EDB、EAGLE 网络记录共用。
e.g. http://www.eagle-network.eu/voc/objtyp/lod/93 = "altar"
如 http://www.eagle-network.eu/voc/objtyp/lod/93 = "altar"(祭坛)
| Field字段 | Common controlled vocabularies常见受控词表 |
|---|---|
| material | EAGLE materials, IRCyr materials, project-local lists |
| object type | EAGLE object-types, AAT (Getty's Art & Architecture Thesaurus) |
| place | Pleiades (~40,000 ancient places), GeoNames (modern) |
| language | ISO 639 codes (la, grc, xpu) |
| script | ISO 15924 codes (Latn, Grek) |
| person | VIAF (Virtual International Authority File), Lexicon of Greek Personal Names (LGPN) |
| execution | Stoa execution-types vocabulary (scalpro, tessellis, graffito…) |
For users of the data: when joining records across databases, the join is most reliable on controlled-vocabulary fields. EAGLE's vocabularies are why an EDR record can be lined up with an EDB record. Free-text fields ("commentary," "notes") cannot be joined that way.
对数据使用者:跨数据库联接时,受控词表字段最可靠。EAGLE 词表正是 EDR 记录能与 EDB 记录对齐的原因。自由文本字段("commentary"、"notes")则无法这样联接。
Click any tile to jump to that family's detailed slide.
点击任一卡片,跳到该家族的详细幻灯。
How they relate: Family 2 produces the deepest records; Family 1 swallows them and re-publishes with shallow joins; Family 3 ingests text only; Family 4 mirrors Family 1 in Italian; Family 5 is a parallel TEI EpiDoc world for papyri; Family 6 distills text for ML training.
彼此关系:家族二产出最深的记录;家族一吞下后浅联接重发;家族三只取文本;家族四是家族一的意大利语镜像;家族五是莎草纸学的 TEI EpiDoc 平行宇宙;家族六把文本蒸馏为 ML 训练数据。
What aggregators are good at: "find me all Latin inscriptions that mention X" — recall-first, breadth-first. What they are not: a faithful witness to any one stone. The deeper you go on a single record, the more you should be reading the original publication, not the aggregator's row.
聚合器擅长的:"给我所有提到 X 的拉丁铭文",召回率优先、广度优先。它们不擅长的:对任何单一石头的忠实见证。越深入单条记录,就越应该回到原始出版物,而不是聚合器的那一行。
Each project covers one region or one corpus, in TEI EpiDoc XML, with rich editorial metadata. Most are in the King's College London family (or its diaspora) and share a common structural vocabulary.
每个项目覆盖一个地区或一种语料,以 TEI EpiDoc XML 编码,附丰厚的编辑元数据。多数项目出自伦敦国王学院家族(及其海外分支),共享一套结构化词表。
| Project项目 | Coverage范围 | ~records条数 | Languages语言 | Licence |
|---|---|---|---|---|
| I.Sicily | Sicily | ~3,500 | Greek, Latin, Punic, Hebrew | CC BY 4.0 |
| IRT | Roman Tripolitania | ~1,618 | Latin, Greek, Punic, Neo-Punic | CC BY 2.0 UK |
| IRCyr | Roman Cyrenaica (Libya) | ~2,360 | Greek, Latin | CC BY 2.0 UK |
| IGCyr / GVCyr | Greek Cyrenaica | ~1,500 | Greek | CC BY-NC 4.0 |
| IAph 2007 | Aphrodisias | ~1,489 | Greek, Latin | CC BY 2.5 |
| IOSPE | Northern Black Sea Coast | varies (multi-volume) | Greek, Russian metadata | CC BY 4.0 |
| InsAph / Vindolanda | Vindolanda tablets | ~700 | Latin (cursive) | CC BY-NC 4.0 |
| AshLI | Ashmolean Latin inscriptions | ~250 | Latin, Greek | CC BY-SA |
| MAMA XI | Phrygia & Lykaonia | ~400 | Greek, Latin | CC BY-NC-SA |
| EpiDoc First1KGreek | various Greek-corpus projects | varies | Greek | CC BY 4.0 |
What they share: the TEI EpiDoc schema; structured fields for material, dimensions, find-spot, dating, transcription with editorial certainty markers; bibliography and apparatus criticus. Records are typically one XML file per inscription, hosted on GitHub, citable by stable URI.
它们共享的是:TEI EpiDoc 模式;为材质、尺寸、出土地、定年、附"编辑确定性"标记的转写而设的结构化字段;参考文献与异文校勘记。记录通常是"一条铭文 = 一份 XML",托管在 GitHub,可通过稳定 URI 引用。
msDesc deep-dive · physDesc · layoutDesc · history · certainty · entities · apparatus · revisionDesc<msDesc> block — how the stone is described<msDesc> 区块,石头是如何被描述的msDesc ("manuscript description," but used for inscribed objects too) is the spine of every EpiDoc record. It nests four blocks:
msDesc("manuscript description",写本描述,但也用于铭文对象)是每条 EpiDoc 记录的主干。它嵌套四个区块:
<msDesc> <msIdentifier> ← where the object lives now (museum, collection, inv. number) <repository ref="institution.xml#db932">Apollonia Museum</repository> </msIdentifier> <msContents> ← what kind of text is on it (a one-line summary) <summary><seg>Dedication.</seg></summary> </msContents> <physDesc> ← physical: material, dimensions, layout, lettering <objectDesc> ... </objectDesc> <handDesc> ... </handDesc> </physDesc> <history> ← origin (when/where made), provenance (when/where seen) <origin> ... </origin> <provenance type="found"> ... </provenance> <provenance type="observed"> ... </provenance> </history> </msDesc>
What you can ask, given this structure: "show me all inscriptions on limestone, found at Cyrene, between 100–300 CE, currently held at the Apollonia Museum." Each clause maps to one nested element. Without msDesc, this query would be impossible without the editor doing manual filtering.
有了这套结构,你能问:"列出所有材质为石灰岩、出土于昔兰尼、年代在 100–300 CE 之间、当前藏于 Apollonia 博物馆的铭文。"每一个子句都映射到一个嵌套元素。没有 msDesc,这种查询离不开编辑的人工筛选。
Notice the ref="institution.xml#db932": repository names are not free text — they refer to a project-internal authority file. This is how EpiDoc projects keep "Apollonia Museum" written exactly the same way across thousands of records. Other commonly externalized lists: execution.xml, findspot.xml, material-search.xml.
注意 ref="institution.xml#db932":repository 名不是自由文本,而是引用项目内部的权威文件。这正是 EpiDoc 项目能在几千条记录中把"Apollonia Museum"写得完全一致的方法。其他常被外置的列表:execution.xml、findspot.xml、material-search.xml。
<physDesc> — material, dimensions, lettering<physDesc>,材质、尺寸、字形A real physDesc from IRCyr's record C24400 (a fragmentary building inscription from the temple of Artemis at Cyrene):
来自 IRCyr 记录 C24400(昔兰尼阿尔忒弥斯神庙的一段残断建筑铭文)的真实 physDesc:
<physDesc>
<objectDesc>
<supportDesc>
<support>
<p><material>Limestone</material> block with moulding above,
from a Doric <objectType>entablature</objectType>,
broken in two and damaged at the lower right corner
(<dimensions>
<width>1.03</width>
<height>0.59</height>
<depth precision="low">0.44</depth>
</dimensions>).</p>
</support>
</supportDesc>
<layoutDesc>
<layout><rs type="execution" key="scalpro">Inscribed</rs> on one face.</layout>
</layoutDesc>
</objectDesc>
<handDesc>
<handNote>Second century: <height>0.07</height>; ligature in line 2</handNote>
</handDesc>
</physDesc>
What this records:
记录了什么:
material = Limestone (could be controlled vocab)material = 石灰岩(可由受控词表)objectType = entablature, with prose context "from a Doric…"objectType = 檐部,附上下文"从一座多立克式…"dimensions = 1.03m × 0.59m × 0.44m. The depth has precision="low" — the editor was uncertain.dimensions = 1.03m × 0.59m × 0.44m。深度带 precision="low",编辑者不太确定。execution with key="scalpro" — a controlled-vocab term (sculpted/incised). External taxonomy.execution 带 key="scalpro",受控词表中的"刻凿"项。外置词表。height = 0.07m, plus a free-text note about ligatures.height = 0.07m,加自由文本备注(连字)。Compare to EDCS for the same kind of object: EDCS would record material as a single string like "lapis", dimensions are absent entirely, and lettering height is absent. The depth + breadth trade-off again — IRCyr knows much more per record but is much smaller.
与 EDCS 对照:EDCS 会把材质记成一串 "lapis";尺寸字段完全没有;字母高度也无。深 vs 广的取舍再次显形,IRCyr 单条记录懂得多得多,但总条数小得多。
<layoutDesc> — where the letters sit on the surface<layoutDesc>,字母在表面的位置"Layout" in epigraphy means: how is the inscribed text spatially arranged on the object? Is it in one block or several? Is there an iconographic field next to it? Are there frames, dividers, vacant spaces? IRCyr's record A00300 (a Christian mosaic caption for an image of Noah) shows what this looks like:
铭文学中的"layout"指:刻文如何在对象表面空间分布?是一段还是数段?旁边有无图像区?是否有边框、分隔、空白?IRCyr 记录 A00300("诺亚释鸽"马赛克的希腊文 caption)展示了这是什么样:
<layoutDesc>
<layout>
<rs type="execution" key="tessellis">Mosaic</rs> lettering, in a panel
representing <rs type="iconography">Noah</rs> as he releases the
<rs type="iconography">dove</rs> from the <rs type="iconography">Ark</rs>.
</layout>
</layoutDesc>
The layered information: execution="tessellis" = "made of mosaic tesserae" (controlled vocabulary). Three iconographic fields tagged: Noah, dove, Ark. Each iconography tag is a hook for cross-image searches: "show me all Christian mosaics in IRCyr that depict the dove."
分层信息:execution="tessellis" = "由马赛克小方砖拼成"(受控词表)。三个图像学标签被打上:诺亚、鸽子、方舟。每个 iconography 标签都是一条跨图像检索的"钩子":可问"列出 IRCyr 中所有描绘鸽子的基督教马赛克"。
What no aggregator captures: the relation between text and image. PHI strips the iconographic context entirely. EDCS records mosaic inscriptions as plain text. Only EpiDoc projects keep the spatial-iconographic field. If you want to study how Christian inscriptions were embedded in their visual programmes, you must work in TEI EpiDoc.
聚合器从不捕获的:文字与图像的关系。PHI 把图像语境彻底剥离;EDCS 把马赛克铭文记成纯文本。只有 EpiDoc 项目保留了"空间-图像场"这一字段。若你想研究基督教铭文如何嵌入其视觉方案中,就必须在 TEI EpiDoc 上工作。
<history> — when made, when found, where now<history>,何时所作、何时被发现、现存何处Inscriptions have biographies. They are made at one time and place; they are found later, sometimes much later; they are then moved (often more than once) and may be on display, in storage, or lost. EpiDoc records this with three sub-elements inside <history>:
铭文有自己的"传记"。它在某一时间地点被造出,后来被发现(往往晚得多),又被搬动(不止一次),如今可能在展厅、库房,或失踪。EpiDoc 在 <history> 中以三个子元素记录这一切:
<history> <origin> ← when and where made <origPlace ref="...">Findspot.</origPlace> <origDate notBefore="0117" notAfter="0138" precision="low" evidence="reign">C.E. 117-138</origDate> </origin> <provenance type="found"> ← when and where excavated/observed <p><placeName type="ancientFindspot" ref="https://www.slsgazetteer.org/909">Cyrene</placeName>: <placeName type="monuList" ref="https://www.slsgazetteer.org/1051">Temple of Artemis</placeName>; recorded in <date type="found">1861</date>.</p> </provenance> <provenance type="observed"> ← currently visible at this location <p>Temple of Artemis: standing on the wall to the right of the entrance.</p> </provenance> </history>
What this lets you ask: "show me all inscriptions found at Cyrene before 1900, currently lost." (You filter provenance/@type="found" with date < 1900, but find no provenance/@type="observed".) Or: "match origDate to provenance: how many years between an inscription's making and its rediscovery?" Possible because origin and provenance are encoded as separate, structured elements.
能问的问题:"列出昔兰尼出土、年代在 1900 年以前、现已失踪的铭文。"(筛 provenance/@type="found" 且日期 < 1900 而无 provenance/@type="observed"。)或:"把 origDate 对照 provenance:刻成与重见天日之间相隔多少年?"之所以能问,是因为 origin 与 provenance 各自是独立的结构化元素。
Aggregator equivalent: EDCS has a single place field that conflates origin and findspot. EDH separates them better. Only TEI EpiDoc keeps the full biography.
聚合器对应:EDCS 用一个 place 字段,把"刻作处"与"出土处"混在一起。EDH 分得稍好。只有 TEI EpiDoc 保留了完整的传记。
EpiDoc treats uncertainty as a first-class citizen. Almost every dating, attribution, and reading attribute can carry confidence markers, so a downstream analysis can either filter on them or propagate them.
EpiDoc 把"不确定性"作为一等公民对待。几乎每一个定年、归属、释读属性都可携带置信度标记,下游分析可据此过滤或传播。
| Attribute属性 | What it marks标的对象 | Example示例 |
|---|---|---|
cert="low" | low confidence in this value对此值置信度低 | <persName cert="low">Apollinaris</persName> |
precision="low" | value is approximate, not exact值为近似而非精确 | <depth precision="low">0.44</depth> |
evidence="…" | on what basis was this attributed以什么依据归属 | evidence="lettering" / "reign" / "context" |
resp="#jmr" | who is responsible for this judgement此判断由谁负责 | <date resp="#jmr">…</date> (J. M. Reynolds) |
notBefore / notAfter | date range bounds年代区间端点 | notBefore="0117" notAfter="0138" |
atLeast / atMost | measurement bounds测量值区间 | <height atLeast="0.003" atMost="0.006">0.003-0.006</height> |
match=".." | scope of the certainty置信度的作用域 | <certainty cert="low" locus="value" match=".."/> |
Why this drives method: the JDH 2021 paper's tempun toolkit propagates date uncertainty by Monte-Carlo-sampling each inscription's date range many times — only possible because notBefore and notAfter are separately encoded. If you collapse the date to a single number, you lose the basis for honest uncertainty propagation.
这如何驱动方法:JDH 2021 中的 tempun 工具包通过对每条铭文的年代区间做蒙特卡洛抽样来传播不确定性,这件事之所以可能,是因为 notBefore 与 notAfter 各自被独立编码。若把日期塌成一个数,就失去了"诚实地传播不确定性"的基础。
EpiDoc tags every proper name in the transcription, classifying them and linking to authority files. This makes "all inscriptions mentioning emperor Hadrian" a structured query rather than a string-search.
EpiDoc 把转写中的每一个专名都打上标签、分类、并链到权威文件。这让"所有提到哈德良元首的铭文"成为一项结构化查询,而非字符串搜索。
| Element元素 | What it marks标的对象 | Subtype examples子类型示例 |
|---|---|---|
<persName> | a person人名 | type="emperor" / "divine" / "attested" / "aphrodisian" |
<placeName> | a place地名 | type="ancientFindspot" / "modernFindspot" / "monuList" |
<orgName> | an organisation组织名 | e.g. a guild, college, civic body |
<name> | a single name (within a persName)一段单独的名字(嵌于 persName 内) | type="surname" / type="forename" |
<geogName> | a geographical region地理区域 | type="ancientRegion" key="Cyrene" |
<rs> | "referencing string" — generic semantic tag"指涉串",通用语义标签 | type="iconography" / "textType" / "execution" |
A real example from IAph 010001 (a Greek inscription mentioning the Roman dictator Caesar and the goddess Aphrodite):
真实示例(IAph 010001,一段希腊铭文,提到罗马独裁者凯撒与阿芙罗狄忒女神):
<persName type="emperor"> <name reg="Καῖσαρ"><supplied reason="lost" cert="low">Καῖσαρ</supplied></name> </persName> ... <persName type="aphrodisian"> <name reg="Γάϊος"><supplied reason="lost">Γάϊος</supplied></name> <name reg="Ἰούλιος"><supplied reason="lost">Ἰούλιος</supplied></name> <name reg="Ζωΐλος"><supplied reason="lost">Ζωΐλος</supplied></name> </persName> ... <persName type="divine"> <name reg="Ἀφροδίτη"><supplied reason="lost">Ἀφροδείτης</supplied></name> </persName>
The reg attribute normalises the surface form to a "regularised" form for searching. So reg="Ἀφροδίτη" on a stone that actually has Ἀφροδείτης means "this is the goddess Aphrodite, normalised." Now you can ask "all inscriptions mentioning Aphrodite" and get hits even where the spelling varies.
reg 属性把表面形归一化为标准形以便检索。如石面写 Ἀφροδείτης,但 reg="Ἀφροδίτη" 表示"这是阿芙罗狄忒女神的归一形"。于是可问"所有提到阿芙罗狄忒的铭文",即使拼法不同也能命中。
An ancient inscription is rarely transcribed identically by every editor who has read it. The apparatus criticus is the place where competing readings are recorded — Editor A reads x, Editor B reads y, the present editor follows A. EpiDoc represents this as a special <div type="apparatus"> block.
同一块铭文很少能让所有读过它的编辑给出完全一致的转写。"异文校勘记"就是用来记录互相竞争的释读,编辑 A 读 x,编辑 B 读 y,本版从 A。EpiDoc 用一个特殊的 <div type="apparatus"> 区块来呈现。
<div type="apparatus">
<p>Line 2: <app>
<lem resp="#jmr">Apolli<supplied reason="lost">naris</supplied></lem>
<rdg resp="#jbwp">Apolli<supplied reason="lost">nius</supplied></rdg>
</app>.</p>
<p>For the supplements, compare the partner inscription
<ref type="inscription" n="1581" href="010038">1.38</ref>.</p>
</div>
Reading this: in line 2, the present editor (Joyce M. Reynolds) reads "Apollinaris"; an alternative editor (J. B. Ward-Perkins) read "Apollinius." The <app> tag wraps the disputed passage; <lem> ("lemma") is the chosen reading; <rdg> ("reading") is each alternative. Each carries resp="…" linking to an editor's xml:id.
读法:第 2 行中,本版编辑(Joyce M. Reynolds)读为 "Apollinaris";另一编辑(J. B. Ward-Perkins)读为 "Apollinius"。<app> 标签包住有争议段落;<lem>(lemma 当选释读);<rdg>(reading 备选释读)。每一个都带 resp="…",链到某编辑的 xml:id。
What this enables: "show me all inscriptions where two editors disagree on a personal name." That query is impossible against EDCS or PHI (which collapse the apparatus into the main text or omit it entirely). It is a single XPath against EpiDoc XML.
由此能做什么:"列出所有'两位编辑对人名释读不一致'的铭文。"此查询在 EDCS 或 PHI 上无解(它们把校勘并入正文或干脆删除)。在 EpiDoc XML 上则只是一行 XPath。
<revisionDesc> — the digital record's biography<revisionDesc>,数字记录自身的传记Just as the stone has a history, the digital record about it has one too. Every meaningful change to a TEI file is logged with date, editor, and a short note. This is editorial transparency at its most basic: any reader can see who decided what, when.
石头有历史,关于它的数字记录也有。TEI 文件每一次有意义的修改,都附日期、编辑者、简短说明。这是最基本的编辑透明:任何读者都能看清"谁、何时、做了什么决定"。
<revisionDesc> <change when="2021-02-20" who="CMR">adapted</change> <change when="2013-08-22" who="TW">tagged words, names and placenames</change> <change when="2012-11-05" who="GB">moved text-constituted-from to profileDesc/creation</change> <change when="2012-10-16" who="GB">broke provenance events into multiple provenance elements</change> <change when="2010-08-18" who="GB">Converted from TEI P4 (EpiDoc DTD v. 6) to P5 (EpiDoc RNG schema v. 8)</change> <change when="2009-08-24" who="RV">Added Figures</change> <change when="2008-09-09" who="ZA">converted using CHET-C</change> </revisionDesc>
What you can read out of this: the IRT0001 record was originally created in TEI P4 (older), converted to P5 in 2010, had figures added in 2009, broke its provenance into separate elements in 2012, gained word/name/place tagging in 2013, and was last adapted in 2021. The digital record's history is itself a research object — encoding fashions changed; this file evolved with them.
能读出的信息:IRT0001 记录最初以 TEI P4(旧版)创建,2010 年转为 P5;2009 年增图;2012 年把 provenance 拆为独立元素;2013 年加上词/人名/地名标注;2021 年最后一次调整。数字记录自身的历史本身就是一项研究对象,编码风潮变迁,本文件随之演进。
For citation: when citing a TEI EpiDoc record, modern practice is to include the last revision date alongside the URI. The data may have changed since you first read it; the revision-history makes that explicit.
引用时:引用一份 TEI EpiDoc 记录,现代规范是把 最近修订日期 与 URI 一并列出。读到之后数据可能已变;revision-history 把这件事写明。
The Inscriptions of Roman Tripolitania (IRT) project: 1,618 records from Roman-period Libya. Each is a TEI XML file with this structure:
罗马时期 Tripolitania 铭文项目(IRT):1,618 条来自利比亚罗马时期的铭文。每条是一份 TEI XML,结构如下:
<TEI xml:id="IRT0001" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Reference to Apollo (?)</title> <editor xml:id="jmr">J. M. Reynolds</editor> </titleStmt> <publicationStmt> <publisher>Society for Libyan Studies</publisher> <idno type="filename">IRT0001</idno> <availability><p>CC BY 2.0 UK</p></availability> </publicationStmt> <sourceDesc><msDesc> <msIdentifier><repository>Findspot</repository></msIdentifier> ...physical description, layout, dimensions, dating... </msDesc></sourceDesc> </fileDesc> </teiHeader> <text><body> <div type="edition" xml:lang="la">...transcription with line numbers, brackets, certainty markers...</div> <div type="translation">...</div> <div type="commentary">...</div> <div type="bibliography">...</div> </body></text> </TEI>
Notice the four <div type=…> blocks: edition (transcription), translation, commentary, bibliography. This is the EpiDoc convention. Every regional project follows it. An aggregator like EDCS, by contrast, flattens these into separate columns, often losing the editorial-certainty markup.
注意四个 <div type=…> 区块:edition(转写)、translation(译文)、commentary(评注)、bibliography(参考文献)。这是 EpiDoc 的惯例。每个区域项目都遵循它。聚合器(如 EDCS)则把这些字段铺平为独立列,常常在过程中丢失"编辑确定性"标记。
The Aphrodisias Inscriptions project (IAph 2007 — Joyce M. Reynolds, KCL) goes one step further than most regional projects: every word in every Greek transcription carries a <w lemma="…"> wrapper that identifies its dictionary form. This makes morphological queries possible at scale — without re-parsing the text.
阿芙罗狄西亚铭文项目(IAph 2007,KCL,Joyce M. Reynolds 主编)比大多数区域项目更进一步:每个希腊文转写中的每个词都包在 <w lemma="…"> 元素里,标注其辞典形(lemma)。这使得"按词形查询"可以批量进行,不需要重新分词。
<ab><lb n="1"/> <w lemma="οὗτος"><supplied reason="lost">οὗτος</supplied></w> <w lemma="ὁ"><supplied reason="lost">ὁ</supplied></w> <w lemma="τόπος"><supplied reason="lost">τόπο</supplied><unclear reason="damage">ς</unclear></w> <w lemma="ἱερός">ἱερὸς</w> ...
Reading this: "οὗτος" is reconstructed (lost on the stone, supplied by the editor); "τόπος" is partly lost and partly damaged; "ἱερός" is intact. The lemma values let you ask "how often does ἱερός occur in IAph?" without re-tokenizing.
读法:"οὗτος" 是补足(石上已失,由编辑补;"τόπος" 部分缺失、部分残损;"ἱερός" 完整保留。lemma 标注让你能够直接问:"ἱερός 在 IAph 中出现多少次?",而不需要重新切词。
Trade-off: This depth costs editorial labour. IAph 2007 has 1,489 records — a large team's lifetime work. EDCS has 537,000 records, but no lemma tagging at all. Depth and breadth are inversely correlated.
取舍:这种深度需要编辑投入大量人工。IAph 2007 收 1,489 条,一支团队几代人的工作。EDCS 收 53.7 万条,但完全没有 lemma 标注。深度与广度成反比。
From the East Church at Apollonia (Cyrenaica), 6th century CE: a mosaic floor depicting Noah releasing the dove from the Ark, with a Greek caption. Read this record as one example of how EpiDoc captures genre + visual programme + Christian context.
来自昔兰尼加阿波罗尼亚的东教堂,公元 6 世纪:一幅马赛克地板,描绘诺亚从方舟中释放鸽子,旁附希腊文 caption。把这条记录当作一个范例:EpiDoc 如何同时记录"体裁 + 视觉方案 + 基督教语境"。
<idno type="filename">A.1</idno> <idno type="ircyr2012">A00300</idno>
Two parallel IDs: project filename + project's own 2012 numbering. No TM-ID — the inscription is too local for that registry.
两个并行 ID:项目文件名 + 项目 2012 年的编号。无 TM-ID,该铭文对那套登记来说过于在地。
<title> <rs type="textType">Caption</rs> for image of Noah </title>
textType="Caption" distinguishes this from a dedication, an epitaph, an acclamation, etc. Joinable with all other "captions" in the corpus.
textType="Caption" 把它与"题献""墓志""欢呼语"等区别开。可与语料中其他所有"caption"做联接。
<layout> <rs type="execution" key="tessellis">Mosaic</rs> lettering, in a panel representing <rs type="iconography">Noah</rs> as he releases the <rs type="iconography">dove</rs> from the <rs type="iconography">Ark</rs>. </layout>
<origDate notBefore="0501" notAfter="0600"
precision="low" evidence="mosaic">Sixth century CE</origDate>
Notice the evidence="mosaic" — this is the chain of reasoning made queryable. You could ask "all mosaic-dated Christian captions in IRCyr."
注意 evidence="mosaic":把"推理链"做成可查询。可以问"IRCyr 中所有以马赛克风格定年的基督教 caption"。
<keywords scheme="IRCyr">
<term><geogName type="ancientRegion" key="Cyrene">Cyrenaica</geogName></term>
<term><geogName type="modernCountry" key="LY">Libya</geogName></term>
<term><placeName type="modernFindspot"
ref="http://sws.geonames.org/81584">Marsa Suza</placeName></term>
<term>Christian</term>
</keywords>
The full record exists in your local ~/Documents/epidoc/cyrenaica/A.1.xml. Across all of IRCyr's 2,360 records, this same structural vocabulary applies — which makes corpus-level questions ("how many Christian inscriptions in 6th-century Cyrenaica are mosaic captions?") into single XPath queries.
完整记录就放在你本地的 ~/Documents/epidoc/cyrenaica/A.1.xml。IRCyr 全部 2,360 条记录都遵循同一套结构化词汇,因此"6 世纪昔兰尼加有多少基督教马赛克 caption"这种语料级问题,化作一行 XPath 查询。
A different genre: a Greek dedication scratched onto a Corinthian-ware kothon (a small ritual jug), found at Taucheira (Cyrenaica) in 1963–65. The IGCyr record (002200) is shorter than the Roman entablature in the previous case study — but the same structural skeleton is there.
另一个体裁:一段希腊文奉献铭文,刻于一只科林斯式 kothon(小祭祀壶)的器壁上,1963–65 年间在塔乌克拉(昔兰尼加)出土。IGCyr 记录(002200)比上一案例的罗马檐部短得多,但同一套结构骨架仍然在场。
<support><p>Two adjacent fragments of a
<objectType>kothon</objectType>,
<material>Corinthian ware</material>
(<dimensions>
<width atLeast="0.074">0.074-</width>
<height/><depth/>
</dimensions>).</p></support>
Width recorded as atLeast="0.074" — a one-sided bound, since the object is broken. Height and depth are empty elements: explicit "we don't know."
宽度记为 atLeast="0.074",单边下界,因物已残。高、深为 空 元素:显式表达"未知"。
<handNote>
<height atLeast="0.003" atMost="0.006">
0.003-0.006</height>;
epsilon with three equal bars,
straight iota,
rho without tail.
</handNote>
Letter shapes — three-bar epsilon, straight iota, tail-less rho — are the basis of the date estimate. Encoded as prose, but cite-able.
字母形态(三条平行 epsilon、笔直 iota、无尾 rho)是定年依据。以散文记录,但可被引用。
<layout> <rs type="execution" key="graffito">Scratched</rs> on wall, below frieze. </layout>
The execution field distinguishes graffiti from formal inscriptions, mosaics, paint, etc. How does scratched-on-pottery dedication culture differ from formally-cut stone dedication? — the kind of question this controlled value lets you frame.
execution 字段把涂刻区分于正式刻铭、马赛克、绘文等。"陶器上刮刻的奉献文化"与"石上正式刻凿的奉献"如何不同?",受控值让这类问题得以提出。
<provenance type="found" notBefore="1963" notAfter="1965"> Found between 1963 and 1965 at Taucheira: archaic votive deposit. </provenance> <provenance type="observed" when="1965" resp="Boardman"> Seen in 1965 by J. Boardman. </provenance> <provenance type="not-observed"> Not seen by IGCyr team. </provenance>
Three sequential events: found, observed by Boardman, not observed by the present editors. The negative event is itself a record — modern editorial honesty about not having had autopsy.
三个先后事件:出土、被 Boardman 观察、本版编辑 未 观察。"未观察"也是一项记录,当代编辑对"未亲自察看"的诚实。
The whole IGCyr corpus is licensed CC BY-NC 4.0; bulk download via Bologna's repository at doi.org/10.6092/UNIBO/IGCYRGVCYR.
整个 IGCyr 语料库采用 CC BY-NC 4.0;批量下载经博洛尼亚机构知识库 doi.org/10.6092/UNIBO/IGCYRGVCYR。
The Inscriptions of the Northern Black Sea Coast (IOSPE) project is a methodologically distinctive case: every editorial field in the TEI is repeated in both Russian and English (with xml:lang="ru" and xml:lang="en"). The project's intended user-base spans two scholarly communities with little overlap; the database serves both natively.
"北黑海沿岸铭文"项目(IOSPE)是方法论上的独特案例:TEI 中的每一个编辑字段都用俄语与英语并列两版(分别带 xml:lang="ru" 与 xml:lang="en")。项目预设的使用者跨越两个几乎不重叠的学术圈,数据库为二者同时原生服务。
<origDate notBefore="1451" notAfter="1452" evidence="эпиграфический_контекст" cert="low"> <seg xml:lang="ru">1451–1452 гг.</seg> <seg xml:lang="en">1451–1452 C.E.</seg> </origDate> <provenance type="observed" subtype="autopsy"> <seg xml:lang="ru">Non vidi.</seg> <seg xml:lang="en">Non vidi.</seg> </provenance>
What this lets you do: read the same record in either language, do bilingual NLP work, search Russian terminology against English equivalents. What it costs: editorial labour roughly doubled. Who benefits: Black-Sea archaeology, where Russian-language scholarship is the largest body.
由此能做什么:同一记录可任选语言阅读;可做双语 NLP;可把俄语术语与其英语对应版互查。代价:编辑工作量约翻倍。受益者:黑海考古学界,俄语文献是该领域最大的存量。
Why these matter despite their simplicity: for the Greek epigraphic world, PHI has been the de facto search engine for 30 years. Many printed editions cite the PHI ID rather than (or alongside) the IG / SEG number. PHI's data ingestion is also not deduplicated — the same stone may appear under multiple PHI numbers because it was published twice in different series.
为何"简朴"仍然要紧:就希腊铭文世界而言,PHI 是事实上的搜索引擎,已三十年。许多印本会用 PHI 编号引用,甚至代替(或与)IG / SEG 编号。PHI 的数据收录 不去重,同一块石头可能因被两套出版物各自著录,在 PHI 中出现多个编号。
EAGLE (Europeana network of Ancient Greek and Latin Epigraphy) was an EU-funded effort to harmonise Italian regional databases under a shared identifier scheme. The data infrastructure outlived the project; the federation today is a loose alliance.
EAGLE("欧洲古希腊语与拉丁语铭文网")是欧盟资助、试图把意大利各区域数据库纳入一套共享标识符方案下的努力。项目结束后,数据基础设施仍在运转;今天它是一个松散联盟。
| Database数据库 | Coverage覆盖 | Records条数 | Strong on强项 | Link |
|---|---|---|---|---|
| EDR | Italy + Italian islands (Latin) | ~80,000 | regional Italian metadata, photographs | edr-edr.it ↗ |
| EDB | Christian inscriptions of Rome | ~40,000 | Christian inscriptions, catacombs | edb.uniba.it ↗ |
| EDV | Vienna's epigraphic database | varies | Roman provinces of Pannonia + Noricum | edv ↗ |
| EAGLE Mediawiki | federation portal — vocabularies, ID resolver | n/a | cross-database ID resolver | eagle-network.eu ↗ |
| Ubi Erat Lupa | Roman stone monuments (image-first) | ~50,000+ | photographs, especially funerary stelai | lupa.at ↗ |
The EAGLE promise: a single resolver that, given any partner's ID, would return the records held in all the others. The reality: partial. Some pairs of partners cross-reference reliably, others don't — so any "EAGLE-aware" query has to be coded defensively.
EAGLE 的承诺:建一个解析器,给定任一合作方的 ID,即可返回其他成员所持有的对应记录。现实:部分实现。某些合作方间交叉引用稳定,另一些则不稳定,因此任何"EAGLE 感知"的查询都必须做防御性编码。
Papyrology and epigraphy are technically distinct (papyrus vs stone) but share the same TEI EpiDoc tooling, the same Leiden conventions, and a sharper identifier ecosystem. Their backbone is papyri.info, a single portal that aggregates DDbDP texts + HGV metadata + APIS images.
莎草纸学与铭文学技术上有别(介质是莎草纸而非石头),但共用 TEI EpiDoc 工具链、共用 Leiden 惯例,且 ID 生态更清晰。它们的中枢是 papyri.info,单一门户,聚合 DDbDP 文本、HGV 元数据、APIS 图像。
| Database数据库 | What it provides提供什么 | IDID | Records条数 |
|---|---|---|---|
| DDbDP | text editions of documentary papyri (TEI) | cde.85.247 | ~60,000 |
| HGV | papyrological metadata (date, place, text-type) | HGV 79368 | ~70,000 |
| APIS | papyrus images (American consortium) | institutional | ~60,000 |
| DCLP | literary papyri (Heidelberg + Duke) | TEI | ~10,000+ |
| Trismegistos | universal join key — TM-ID across all of the above | TM 79368 | ~800,000+ |
| papyri.info | single-portal aggregation of all of the above | n/a — federated | resolves all |
The lesson for inscriptions: papyrology has solved the federation problem the inscription world is still working on. A papyrus has one TM-ID; that one number resolves to text (DDbDP), metadata (HGV), and images (APIS), each maintained by a different institution. Inscription databases — even within EAGLE — don't have an equivalent universal resolver yet.
对铭文界的启示:莎草纸学已解决了铭文界仍在挣扎的"联邦化"问题。一份莎草纸只有 一个 TM-ID;这一个号码解析为文本(DDbDP)、元数据(HGV)、图像(APIS),各由不同机构维护。铭文数据库,即使在 EAGLE 内部,至今没有等同的"通用解析器"。
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader><fileDesc> <titleStmt><title>cde.85.247</title></titleStmt> <publicationStmt> <authority>Duke Collaboratory for Classics Computing (DC3)</authority> <idno type="filename">cde.85.247</idno> <idno type="ddb-hybrid">cde;85;247</idno> <idno type="HGV">140713</idno> <idno type="TM">140713</idno> <availability>CC BY 3.0</availability> </publicationStmt> </fileDesc></teiHeader> <text><body>...transcription...</body></text> </TEI>
Notice the four identifiers in one block: filename, ddb-hybrid, HGV, and TM. Each is the join key into a different system. Most papyri carry all four. Most stone inscriptions, by contrast, carry only one (the project-local one).
注意一段里 四个 标识符:filename、ddb-hybrid、HGV、TM。每一个都是通往另一个系统的"联接键"。绝大多数莎草纸都同时带全部四个。相比之下,绝大多数石质铭文只带一个(项目自定的那个)。
Practical implication: if you build an analysis pipeline that joins inscription text to geography, dating, or images, you will spend most of your engineering time figuring out which records correspond. With papyri, this is a JOIN ON tm. With inscriptions, it is custom code per database.
实际意味:若你写一个把"铭文文本 → 地理 → 定年 → 图像"联起来的分析管道,绝大多数工程时间会耗在搞清楚"哪条记录对应哪条"。莎草纸只需 JOIN ON tm。铭文则要为每个数据库写一段自定义代码。
DeepMind's Ithaca (Assael et al., Nature 2022) is a transformer-based model that proposes restorations, geographical attributions, and dates for damaged Greek inscriptions. Its training data — iPHI — is PHI's text dump, deduplicated and re-formatted into train/val/test splits.
DeepMind 的 Ithaca(Assael 等,Nature 2022)是一个基于 transformer 的模型,可为损坏的希腊文铭文给出修复、地理归属、定年的建议。它的训练数据,iPHI:是 PHI 文本导出版,经去重并重新切分为 训练 / 验证 / 测试 集。
The lesson: ML datasets are distilled versions of the original corpus. Ithaca's restorations are extraordinary, but the model never sees the stone, the squeeze, the photo, or the editorial commentary. When the model proposes a date, it is generalising over the patterns in the labels — including the biases in those labels.
启示:ML 数据集是原语料库的 蒸馏版。Ithaca 的修复能力非凡,但它从未看到石头、压模、照片或编辑评注。当模型给出年代时,它是在标签的模式上泛化,也包括那些标签中的 偏差。
Other ML datasets in the family: PHI-Latin (Latin-side mirror, in development), Latin BERT (trained on classical Latin corpora — not just inscriptions), and various single-task fine-tunings released as Hugging Face checkpoints.
家族内其他 ML 数据集:PHI-Latin(拉丁面镜像,开发中)、Latin BERT(在古典拉丁语料上训练,不限铭文)、以及若干以 Hugging Face checkpoint 形式发布的单任务微调模型。
If you work with inscriptions and you also use literary texts, you'll meet TEI in this other shape. The canonical-greekLit repository (Open Greek and Latin / Perseus) holds works of ancient Greek authors, each one a TEI XML file inside a folder structure that is itself a citation system.
若你做铭文研究、又用古典文学文本,会遇到 TEI 的另一形态。canonical-greekLit 仓库(Open Greek and Latin / Perseus 项目)收录古希腊作家作品,每一篇都是一份 TEI XML 文件,所在文件夹结构本身就是一套引用系统。
data/ tlg0057/ ← Galen (TLG author #57) __cts__.xml ← textgroup metadata tlg010/ ← work #10 (On the Natural Faculties) __cts__.xml ← work-level metadata tlg0057.tlg010.perseus-grc2.xml ← Greek text edition tlg0057.tlg010.perseus-eng2.xml ← English translation
urn:cts:greekLit:tlg0057.tlg010.perseus-grc2:1.5.3
↑ ↑ ↑ ↑ ↑
namespace author work edition passage
Reading: "in the canonical Greek literature collection, Galen's work tlg010, the Perseus 2 Greek edition, book 1, chapter 5, section 3." This URN can be plugged into Scaife Viewer to render that exact passage.
读法:"在 canonical-greekLit 集合中,盖伦作品 tlg010,Perseus 第 2 希腊文版,第 1 卷第 5 章第 3 节。"此 URN 可输入 Scaife Viewer,直接呈现该段。
How this differs from inscription EpiDoc: for literature, the structural skeleton is book / chapter / section, not edition / translation / commentary / apparatus. The same TEI namespace; different conventions about which sub-elements appear and how they nest. EpiDoc projects use <div type="edition">; literary projects use <div type="textpart" subtype="book" n="1">. Both are valid TEI.
与铭文 EpiDoc 的差别:文学作品的结构骨架是 卷/章/节,而非 edition/translation/commentary/apparatus。同一个 TEI 命名空间;不同惯例下子元素出场顺序与嵌套方式不同。EpiDoc 项目用 <div type="edition">;文学项目用 <div type="textpart" subtype="book" n="1">。两者都是有效 TEI。
Why this matters for inscription work: inscriptions sometimes quote, paraphrase, or refer to literary texts. With both sides in TEI + URNs, you can join "Pindar Ol. 13.5" mentioned in IRCyr to the actual passage in Pindar's TEI file. Without that infrastructure, you'd be doing it by hand.
这对铭文研究为何要紧:铭文有时引用、化用、提及文学文本。两端都在 TEI + URN 体系下,你可以把 IRCyr 中提到的 "Pindar Ol. 13.5" 联接到 Pindar TEI 文件中的实际段落。没有这套基础设施,就只能逐一人工处理。
A concrete walk-through of how one inscription becomes a row in the SDAM analysis-ready dataset, naming the database family at each step:
具体走一遍:一块铭文如何变成 SDAM 分析就绪数据集中的一行,每一步命名所属"家族":
| Step步 | What happens发生什么 | Format格式 | Family家族 |
|---|---|---|---|
| 1 | Editor (Mommsen, 1883) transcribes the stone in autopsy编辑(Mommsen,1883)实地察看后转写 | printed CIL X volume印本 CIL X 卷 | print authority印本权威 |
| 2 | EDCS team re-keys CIL into a relational recordEDCS 团队把 CIL 重新键入关系型记录 | HTML web record | F1 aggregator |
| 3 | SDAM ETL pipeline (EDCS_ETL) scrapes the EDCS web recordSDAM ETL(EDCS_ETL)抓取 EDCS 网页记录 | raw HTML → JSON | F1 + ETL |
| 4 | Pipeline runs regex cleaning to produce conservative + interpretive管道跑正则清洗,产出 conservative 与 interpretive | cleaned JSON fields | F1 + ETL |
| 5 | Geographic info joined to Pleiades URI for the place将地理信息联接到 Pleiades 的地名 URI | JSON-LD enriched | linked-data |
| 6 | Date range fed into tempun Monte Carlo sampling把年代区间送入 tempun 做蒙特卡洛抽样 | numpy arrays per record | SDAM-internal |
| 7 | All ~537,000 records concatenated into a parquet file~537,000 条全数并入一份 parquet | LIST_v0-3.parquet | F1 derivative |
What survives the chain: the cleaned text, the place coordinates, a date probability distribution, the EDCS-ID, partner_link to PHI/EDR/I.Sicily/OSU. What does not survive: the editor's confidence on individual letters; the apparatus criticus; the photograph URL (often dropped); the structural distinction between Latin and Greek halves of bilingual stones.
这条链上幸存的:清洗文本、地理坐标、年代概率分布、EDCS-ID、partner_link 指向 PHI/EDR/I.Sicily/OSU。不幸存的:编辑者对单个字母的置信度;异文校勘;照片 URL(常被丢);双语铭文中拉丁面与希腊面的结构区分。
For full mechanics, see Reference Edition § 4.5 (the actual code) or the Visual edition (the same chain visually).
TEI (Text Encoding Initiative) is the XML standard for textual scholarship in the humanities. EpiDoc is its epigraphy customisation — a smaller, sharper vocabulary for writing inscriptions that preserves editorial judgement as structured tags. Most regional EpiDoc projects, papyrology databases, and even some aggregator backends speak this language.
TEI(文本编码倡议)是人文学科文本研究的 XML 标准。EpiDoc 是其铭文学版本,一套更小、更锐利的词表,用以书写铭文,并把编辑判断保留为结构化标签。绝大多数区域 EpiDoc 项目、莎草纸数据库、甚至部分聚合器后台都说这门语言。
| EpiDoc tagEpiDoc 标签 | What it marks标的对象 | Leiden equivalentLeiden 等价物 |
|---|---|---|
<supplied reason="lost"> | letters supplied by the editor (lost on stone)编辑补足的字母(石上已失) | [ ] |
<supplied reason="omitted"> | letters not on stone but inferred石上无、但推得 | < > |
<unclear> | letters partially visible仅部分可见 | ạ (dotted) |
<gap reason="lost"> | unrecoverable lost section不可恢复的缺失段 | [- - -] |
<expan><abbr> | abbreviation + expansion缩略 + 展开 | D(is) M(anibus) |
<lb n="…"/> | line break with line number换行(带行号) | / 1 |
<w lemma="…"> | word with dictionary form词+辞典形(lemma) | (no equivalent) |
<persName> / <placeName> | named entity命名实体 | (no equivalent) |
<date notBefore="…" notAfter="…"> | structured date range结构化年代区间 | (prose only) |
Why this matters: EpiDoc adds three things that Leiden conventions alone cannot — queryable certainty (you can filter for "high-confidence readings only"), queryable lemmas (you can search by lexical form, not by surface string), and queryable named entities (you can ask "how often does Diocletian appear in IRT?").
为何要紧:EpiDoc 在 Leiden 惯例 之外多了三样东西,可查询的确定性(可只筛"高置信度的释读")、可查询的词形(按辞典形而非表面拼写搜索)、以及 可查询的命名实体(可问"Diocletian 在 IRT 中出现多少次?")。
Before computers, epigraphers signalled editorial judgement with a system of brackets and dots, agreed at the 1931 Leiden conference. Almost every printed transcription you read still uses them. PHI inherits them as plain-text glyphs; EpiDoc translates them into XML tags.
计算机出现前,铭文学者用 1931 年 Leiden 会议确立的一套括号与符号系统传达编辑判断。几乎你读到的每一份印本转写都还在用它。PHI 把它们当作纯文本字符继承下来;EpiDoc 则把它们翻译成 XML 标签。
| [abc] | letters lost, restored by editor已失,由编辑补 |
| [- - -] | unrecoverable gap (3 letters)不可恢复缺口(3 字母) |
| a̲b̲c̲ or ạḅc | letters partly visible仅部分可见 |
| (abc) | expansion of an abbreviation缩略展开 |
| ⟨abc⟩ | letters omitted on stone, supplied石上漏刻,补入 |
| {abc} | letters wrongly inscribed, deleted误刻,编辑去除 |
| ⟦abc⟧ | erased on stone石上被擦除 |
| «abc» | re-inscribed over an erasure擦除处重新刻 |
| vac. | vacant space (uncut)空格(未刻) |
[abc] → <supplied reason="lost">abc</supplied>
⟨abc⟩ → <supplied reason="omitted">abc</supplied>
{abc} → <surplus>abc</surplus>
⟦abc⟧ → <del rend="erasure">abc</del>
ạḅc → <unclear>abc</unclear>
[- - -] → <gap reason="lost" extent="3"
unit="character"/>
(abc) → <expan><abbr/><ex>abc</ex></expan>
vac. → <space unit="character"
extent="…"/>
A printed CIL volume page will read like a thicket of brackets. EpiDoc XML reads like a thicket of tags. The two convey the same editorial information; the second is queryable, the first is not.
一页印本 CIL 是一片"括号丛林"。一份 EpiDoc XML 是一片"标签丛林"。二者传达同样的编辑信息;后者可查询,前者不可。
A 15-element metadata standard used by libraries and museums (DSpace, ePrints). What OSU's Knowledge Bank uses for its inscription photographs. Names objects with title, creator, date, subject, language, identifier, etc. — but doesn't model an inscription's editorial structure.
图书馆与博物馆(DSpace、ePrints)通用的 15 字段元数据标准。OSU 知识库为其铭文照片所用。给对象配 title、creator、date、subject、language、identifier 等,但不建模铭文的编辑结构。
EDCS's bulk export format. One row per inscription, fixed columns. Easy to load with pandas; easy to chart; loses everything that doesn't fit a column. The SDAM ETL pipeline lives here.
EDCS 批量导出的格式。一行一条铭文,列固定。pandas 易读、易制图;放不进列里的东西都会丢失。SDAM ETL 管道就生活在这一层。
An aspirational layer where each field carries a URI naming what kind of thing it is. Pleiades is JSON-LD. EAGLE's vocabularies aspire to be. This is the world LOD-upgrade-prompts gesture toward.
一种"愿景"层:每个字段都带 URI 标明"它是什么类型"。Pleiades 用 JSON-LD。EAGLE 的词表"志在"如此。LOD 升级提案所指向的世界。
The trade-off: richer encoding (TEI EpiDoc → Linked Data) lets you do more sophisticated queries but requires more editorial labour and a larger learning curve. Flatter encoding (CSV → Dublin Core) is easier to load into pandas / a database / a model but loses fidelity to the editor's judgement. Most projects pick one position on this spectrum and stay there; ETL pipelines like SDAM's translate from one to another, accepting the loss.
取舍:越富的编码(TEI EpiDoc → Linked Data)越能做复杂查询,但要求更多编辑投入与更陡的学习曲线。越扁平的编码(CSV → Dublin Core)越容易加载进 pandas / 数据库 / 模型,但更难忠于编辑判断。大多数项目会在这条光谱上选一个位置守住;像 SDAM 这样的 ETL 管道则在两端之间转换,并承认转换中的损失。
Linked data treats every fact as a triple: subject — predicate — object. Instead of "the inscription EDCS-22000882 was found in Palermo," you write three things:
关联数据把每一条事实视为一个三元组:主语 — 谓语 — 宾语。"铭文 EDCS-22000882 出土于巴勒莫"被写成三件事:
<edcs:22000882> <dcterms:isFoundAt> <pleiades:462492> . ← findspot triple <pleiades:462492> <rdfs:label> "Panhormus" . ← human-readable label <pleiades:462492> <geo:lat> "38.111" . ← geo coordinate
Each subject and predicate is itself a URI. Anyone can add a triple about edcs:22000882 from a different system, and a query engine can join across all of them.
每个主语与谓语本身都是一个 URI。任何人可在另一系统中为 edcs:22000882 添加一条三元组;查询引擎可跨系统联接。
The serialisation. RDF can be written as Turtle, RDF/XML, N-triples, or JSON-LD. JSON-LD is most common today because it looks like ordinary JSON but resolves predicates to URIs.
序列化形式。RDF 可写成 Turtle、RDF/XML、N-triples 或 JSON-LD。JSON-LD 当今最常见,看着像普通 JSON,但谓语会解析为 URI。
{"@context": {"foaf": "http://…"},
"@id": "edcs:22000882",
"foaf:depicts": {"@id":"pleiades:462492"}
}
The query language. You write a graph pattern with variables; the server matches it.
查询语言。你写带变量的图谱模式,服务器返回匹配。
SELECT ?inscription ?place WHERE {
?inscription foaf:depicts ?place .
?place geo:in_region pleiades:Sicilia .
?inscription rdf:type epidoc:Bilingual .
}
Where this lives in epigraphy: Pleiades exposes a SPARQL endpoint and JSON-LD serialisation. EAGLE's vocabularies are linked-data resources. Some EpiDoc projects (I.Sicily, IRT) expose JSON-LD per record. Where it does not yet live: EDCS, EDH, PHI — these speak HTML/JSON, not the graph. The "ETL into linked data" project is real but ongoing.
它在铭文学中已落地之处:Pleiades 提供 SPARQL endpoint 与 JSON-LD 序列化;EAGLE 词表本身就是关联数据资源;部分 EpiDoc 项目(I.Sicily、IRT)按条记录提供 JSON-LD。尚未落地之处:EDCS、EDH、PHI,它们说 HTML/JSON,不说图谱。"ETL 到关联数据"的工程真实存在,但仍在推进中。
IIIF (International Image Interoperability Framework) is the standard most well-funded epigraphy projects use to publish images. It defines two APIs:
IIIF(国际图像互操作框架)是经费充足的铭文项目发布图像所用的标准。它定义两套 API:
A URL pattern that returns a region of an image at any zoom and rotation. So …/full/600,/0/default.jpg asks for "the full image, scaled to 600px wide, no rotation, default quality."
一种 URL 模式:返回图像在任意缩放与旋转下的某个区域。如 …/full/600,/0/default.jpg 表示"完整图像、缩放至 600 像素宽、不旋转、默认品质"。
.../iiif/2/ISic000470/ full/full/0/default.jpg .../iiif/2/ISic000470/ 2000,1500,400,400/full/0/default.jpg
A JSON-LD "manifest" that describes a multi-page object: title, sequence of canvases, each canvas pointing to one or more Image-API images, optionally with annotations.
一份 JSON-LD "manifest",描述多页对象:标题、画布序列;每个画布指向一或多张 Image-API 图像,可附标注。
{
"@context": "http://iiif.io/...",
"type": "Manifest",
"items": [{ "type": "Canvas", ... }]
}
Why this matters for inscription study: a IIIF manifest can pull a photograph from KCL, a squeeze scan from Heidelberg, and a 3D-derived rendering from a third institution — into a single Mirador or Universal Viewer reading. You write the manifest once; the user sees a unified object. The bibliographic problem of "where is this stone visible?" becomes the technical problem of "stitching manifests."
这对铭文研究为何要紧:一份 IIIF manifest 可同时拉取 KCL 的照片、海德堡的压模扫描、第三方机构的 3D 渲染,在 Mirador 或 Universal Viewer 中合为单一阅读体验。manifest 写一次,用户看到的是统一对象。"这块石头在哪里可见"的文献学问题,由此化为"manifest 缝合"的技术问题。
Who currently uses it: I.Sicily (full IIIF), AshLI, partial IRT, Vindolanda Tablets Online (multispectral). Who does not (yet): EDCS, EDH, PHI — these still serve plain JPGs at fixed URLs.
目前使用 IIIF 的:I.Sicily(完全 IIIF)、AshLI、部分 IRT、Vindolanda 多光谱版。尚未使用的:EDCS、EDH、PHI,仍以固定 URL 提供普通 JPG。
In a paper, you cite an inscription by stacking three layers: the print authority, the digital project, and the access metadata. Here is a working template plus a real example:
论文中引用铭文,叠加三层:印本权威、数字项目、访问元数据。下面是模板与真实示例:
[Print citation: corpus, volume, number]
= [Digital project name] [project ID]
([URL], accessed [date], rev. [last revision date]).
CIL X 7296 = IG XIV 297
= I.Sicily ISic000470
(http://sicily.classics.ox.ac.uk/inscription/ISic000470,
accessed 2025-04-15, rev. 2024-06-12)
= EDCS-22000882
(https://edcs.hist.uzh.ch/?edcs-id=22000882, accessed 2025-04-15)
= EDR140617
(http://www.edr-edr.it/edr_programmi/res_complex_comune.php?do=book&id_nr=edr140617).
Why all three layers: the print citation is permanent (the book exists in libraries); the digital project ID lets a reader find your exact record; the access date + revision date pin which state of the data you saw. If the digital record changes, your citation still resolves to "the version I saw."
为何要三层:印本引用是永久的(图书馆里有书);数字项目 ID 让读者能找到你看的那条记录;访问日期 + 最近修订日期固定了你所看到的数据状态。即使日后数字记录有变,你的引用仍能解析到"我当时看到的那一版"。
For computational analyses: additionally cite the dataset version. The SDAM convention is to cite the parquet release (e.g., LIST_v0-3.parquet) so any reader can re-derive your numbers from the same starting state.
计算性分析另外要引用 数据集版本。SDAM 的惯例是引用 parquet 发行版(如 LIST_v0-3.parquet),让读者能从同一起点重现你的数字。
Each project assigns its own ID. Some try to be a universal join key (TM, Pleiades for places); others are project-local (ISic, IRT, edcs-id). To find one stone in two databases you usually need a hand-curated cross-reference.
每个项目都自定义 ID。有些试图扮演通用联接键(TM;Pleiades 之于地名);另一些是项目内部用(ISic、IRT、edcs-id)。要在两个数据库中找到同一块石头,通常需要人工维护的交叉引用。
| ID systemID 系统 | Run by维护方 | Format格式 | Purpose用途 |
|---|---|---|---|
| TM | Trismegistos (Leuven) | TM 79368 | universal join across papyri + ostraca + inscriptions |
| Pleiades | ISAW | 462492 | universal place identifier (Mediterranean places) |
| EDCS-ID | Heidelberg | EDCS-22000882 | EDCS internal |
| EDH-HD | Heidelberg | HD046800 | EDH internal |
| PHI numeric | Packard Humanities | 175744 | PHI internal — opaque numeric |
| ISic000470 | Oxford / I.Sicily | seven-digit padded | I.Sicily filename + URI |
| IRT0001 / IRCyr A.1 | KCL | project prefix + number | regional EpiDoc filenames |
| HGV | Heidelberg papyrology | HGV 79368 | papyrological metadata |
| DDB-hybrid | Duke | cde;85;247 | papyrological text editions |
| handle / DOI | various | 1811/99008 | persistent identifier for institutional repositories (OSU KB, Bologna, etc.) |
The aspiration: TM-IDs as the universal join. The reality: TM coverage is extensive for papyri (~800k) but uneven for stone inscriptions; many inscriptions don't have a TM-ID at all. Pleiades is more reliable for places. The EAGLE ID resolver exists but is partial.
愿景:TM-ID 作为通用联接键。现实:TM 在莎草纸上覆盖广(约 80 万),在石质铭文上覆盖参差;很多铭文根本没有 TM-ID。Pleiades 在地名上更可靠。EAGLE ID 解析器存在,但不完整。
| Image strategy图像策略 | Used by由谁采用 | Strengths优点 | Limitations局限 |
|---|---|---|---|
| IIIF deep-zoom | I.Sicily, IRT (partial), AshLI, IIIF-aware projects | deep zoom + cross-repository composition + stable URI | requires IIIF server; rare for older projects |
| JPG URL field | EDCS, EDH, EDR, Ubi Erat Lupa | simple, easy to fetch in bulk | no zoom, often low resolution, link rot |
| Squeeze archive scans | EDH, some museum collections | captures depth + contour | monochrome, often only one face |
| Photograph repository (no transcription) | OSU KB / CEPS, Berlin DAI photo archive | highest-resolution single asset | not linked to a transcription database |
| 3D model | EM Athens collection (partial), some research projects | captures 3D form, supports surface analysis | tiny coverage; large file sizes |
| None | PHI, McCabe corpora, most ML datasets | n/a — pure text | no visual control over editor's reading |
What this means in practice: if your research question depends on visual evidence (lettering style, layout, paint traces), you need to identify which databases hold the image for any given inscription — and accept that for many inscriptions, no public image exists at all. Image presence is itself a piece of metadata most aggregator dumps do not preserve.
实际意味:若你的研究问题依赖视觉证据(字体、版面、留色),你需要先确认哪些数据库拥有特定铭文的图像,并接受一个事实:很多铭文根本没有公开图像。"是否有图"本身就是一种元数据,而绝大多数聚合器导出版并不保留它。
| Database数据库 | Licence授权 | Bulk download批量下载 | Commercial reuse商业再利用 |
|---|---|---|---|
| I.Sicily | CC BY 4.0 | GitHub repo | yes (with attribution) |
| IRT, IRCyr | CC BY 2.0 UK | GitHub / KCL | yes (with attribution) |
| IGCyr / GVCyr | CC BY-NC 4.0 | Bologna repo | no (NC clause) |
| IAph 2007 | CC BY 2.5 | insaph.kcl.ac.uk | yes (with attribution) |
| IOSPE | CC BY 4.0 | KCL | yes (with attribution) |
| EDH | CC BY-SA | SDAM ETL public dump | yes (with share-alike) |
| EDCS | site terms | SDAM ETL via partners; not officially open | grey area — check terms |
| PHI | site terms | not formally open | grey area |
| EDR / EDB | site terms / EAGLE terms | via EAGLE federation | grey area |
| DDbDP / HGV / APIS | CC BY 3.0 | papyri.info, GitHub | yes (with attribution) |
| iPHI / Ithaca | Apache 2.0 (code) + CC BY (data) | GitHub + HF dataset | yes |
| OSU KB images | institutional terms | per-image | often no — request permission |
Practical advice: for any reproducible publication, prefer projects with explicit CC licenses. The SDAM ETL pipelines navigate the grey areas by maintaining their own re-distributable derivatives, hosted on partners (sciencedata.dk) and citable independently of the original databases.
实操建议:就可复现出版而言,优先选用授权明确的 CC 项目。SDAM ETL 管道之所以能在"灰色地带"里通行,是因为它维护自有的可再分发衍生版,托管在合作平台(sciencedata.dk),并独立于原数据库进行引用。
Almost every digital inscription database — even the newest ML dataset — ultimately descends from 19th-century printed editions. Knowing the print line helps you read the digital fields back to the editor who first transcribed the stone.
几乎每一个数字铭文数据库,包括最新的 ML 数据集,都最终源自 19 世纪的印本。了解印本谱系,能让你把数字字段读回到当年第一位转写这块石头的编辑。
Relevant when reading any database: the publication or idno fields you find on a digital record will almost always include a CIL / IG / AE / SEG number. That is the digital project saying: "for the authoritative print transcription, see here." Any ambiguity in the digital text usually traces back to a disagreement among these printed authorities.
读任何数据库时都要记住:数字记录的 publication 或 idno 字段几乎总会包含一个 CIL / IG / AE / SEG 号码。这是数字项目在说:"关于这块石头的权威印本转写,见此。"数字文本中的任何模糊往往可追溯到这些印本权威之间的分歧。
Both are the same physical stone — CIL X 7296 = IG XIV 297. Same object, two corpora, two editorial styles, ~7 years apart in publication. → Full retention/loss analysis in the Case Study
两者是同一块石头,CIL X 7296 = IG XIV 297。同一对象,两套集成,两种编辑风格,相隔约 7 年出版。→ 完整保留/丢失分析在案例版
A real DDbDP record (cde.85.247 / TM 140713) — a tax receipt from the Fayum, dated by the regnal year of Antoninus Pius. Its body shows a different style of EpiDoc: dense Leiden mark-up, expansion tags, certainty markers, all packed into eight lines.
真实 DDbDP 记录(cde.85.247 / TM 140713)—— 来自法雍的一份税收回执,以安东尼努斯・庇乌斯在位年份定年。其本文展示了 EpiDoc 的另一风格:密集的 Leiden 标记、缩略展开、置信度标记,挤在八行之内。
<ab>
<lb n="1"/>ἔτους ἐβδ<supplied reason="lost">ό</supplied>μου
<unclear>Ἀ</unclear><supplied reason="lost">ν</supplied>
<unclear>τω</unclear>νείνου
<lb n="2"/><choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice>
τοῦ κυρίου Μεχεὶρ <num value="23"><hi rend="supraline">κγ</hi></num>
<lb n="3"/><expan>διέγρ<ex>αψε</ex></expan> Ἀρείῳ
<expan>ἐγλ<unclear>ή</unclear>μπ<ex>τορι</ex></expan>
<choice><reg><expan>χειρωναξίο<ex>υ</ex></expan></reg>
<orig><expan>χειροναξίο<ex>υ</ex></expan></orig></choice>
</ab>
What's happening line-by-line:
逐行解读:
<lb n="1"/> — line break, this is line 1.<lb n="1"/>,换行,本行为第 1 行。<supplied reason="lost">ό</supplied> — the editor inserts the omicron because the papyrus is broken there.<supplied reason="lost">ό</supplied>,编辑补 ό,因为这里莎草纸已残。<unclear>Ἀ</unclear> — the alpha is on the papyrus but only partially legible.<unclear>Ἀ</unclear>,α 在纸上但仅部分可读。<choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice> — the scribe wrote "Καίσαξος" (a slip), the editor regularises to "Καίσαρος."<choice><reg>Καίσαρος</reg><orig>Καίσαξος</orig></choice>,抄手误写为 "Καίσαξος",编辑规范为 "Καίσαρος"。<num value="23"><hi rend="supraline">κγ</hi></num> — Greek numeral κγ (= 23) with an overline (the diagnostic mark for "this is a number, not a word").<num value="23"><hi rend="supraline">κγ</hi></num>,希腊数字 κγ(= 23),上有横线("这是数字而非单词"的标记)。<expan>διέγρ<ex>αψε</ex></expan> — abbreviated as "διέγρ" on the papyrus; editor expands to "διέγραψε" (he/she paid).<expan>διέγρ<ex>αψε</ex></expan>,纸上缩写为 "διέγρ";编辑展开为 "διέγραψε"("已支付")。Three layered information types in eight lines: the surface text (what the scribe wrote), the editor's reading (what they think the scribe meant), and the editorial certainty per character (lost / unclear / present). Translate this to flat text and you collapse all three. This is exactly what an aggregator like EDCS does to its records.
八行内三层信息:表面字(抄手所写)、编辑的释读(认为抄手"应该"写的)、按字符的编辑置信度(缺/不清/在)。把它转换成纯文本就把三层都塌成一层。这正是 EDCS 等聚合器对其记录所做的。
Different inscription genres demand different fields. EpiDoc encodes "what kind of text" via <rs type="textType"> in the title and <keywords> in the textClass. Here are the five most common in Mediterranean epigraphy:
不同体裁的铭文需要不同字段。EpiDoc 通过标题中的 <rs type="textType"> 与 textClass 中的 <keywords> 来编码"是什么类型的文本"。地中海铭文学中最常见的五类:
| Genre体裁 | Hallmark特征 | Example database tags数据库标签示例 | Local sample本地样本 |
|---|---|---|---|
| Building inscription | commemorates construction or restoration纪念建造或修复 | textType="building" | IRCyr C24400 (entablature) |
| Dedication | offering to a god, honour, or person奉献给神、荣誉或人 | textType="dedication" + persName type="divine" | IGCyr 002200 (kothon graffito) |
| Epitaph | funerary; names the deceased墓志;命名逝者 | textType="epitaph" · EDH "tit. sep." | EDCS has thousandsEDCS 中数以千计 |
| Caption | labels an image (mosaic, fresco)为图像作题(马赛克、壁画) | textType="caption" + iconography tags | IRCyr A.1 (Noah caption) |
| Honorific | honours a public figure or benefactor为公共人物或恩主立铭 | textType="honorific" + persName "honoree" | IAph 010001 (J. Caesar reference) |
| Workshop sign | advertises craft services广告匠铺业务 | textType="workshop" (rare) | I.Sicily ISic000470 (CIL X 7296) |
| Curse tablet | defixio; private ritual咒符;私人仪式 | textType="curse" · TAM has many | — not in local sample —— 本地样本中无 — |
| Document | contract, receipt (mostly papyri)合同、回执(多为莎草纸) | DDbDP "document" | cde.85.247 (tax receipt) |
What this changes about the database choice: if you ask a question by genre, choose a database that tags genre. EDH has a strong genre vocabulary; EDCS has weaker but still has inscr_type; PHI has none — you'd have to infer genre from the text itself. Genre questions are also where TEI EpiDoc projects' textType + iconography combination really shines.
这如何影响"该用哪个库":若问题以体裁为单位,应选体裁标签强的库。EDH 体裁词表较强;EDCS 较弱但仍有 inscr_type;PHI 无,只能从文本推断。体裁类问题也最能展现 TEI EpiDoc 项目 textType + iconography 标签组合的价值。
Bilingual inscriptions are common around the Mediterranean: Greek + Latin, Punic + Latin, Hebrew + Greek. EpiDoc handles them with parallel <div type="edition" xml:lang="…"> blocks, one per language. Aggregators handle them less gracefully.
地中海地区双语铭文很常见:希腊文+拉丁文、布匿文+拉丁文、希伯来文+希腊文。EpiDoc 用并列的 <div type="edition" xml:lang="…"> 区块(每种语言一份)来处理。聚合器处理得不那么优雅。
| Database数据库 | Bilingual handling双语处理 | Loss丢失 |
|---|---|---|
| I.Sicily | Two <div> blocks, one per language; full editorial markup on each两个 <div> 块,每语种一个;各自完整编辑标记 | none |
| EDCS | Both halves concatenated into one inscription field; the // separator marks the boundary两半并入同一 inscription 字段;以 // 分隔 | structural separation lost |
| EDH | Single transcription field; primary language flagged in language单一转写字段;以 language 标主语种 | secondary language often demoted |
| PHI | Records the Greek face only; the Latin half goes to a separate database (or no record)只录希腊面;拉丁面去别处(或无记录) | Latin half lost |
| EDR | Both halves; Italian-language commentary explains the bilingualism两半皆录;意大利语 commento 解释双语现象 | structural distinction in prose only |
| OSU KB | Photograph; no transcription; bibliographic relation field cites both names照片;无转写;bibliographic relation 字段同时引两个名 | no text at all |
The case study you've already met (CIL X 7296 / IG XIV 297): a bilingual stonecutter's sign. I.Sicily holds it as one record with two <div> blocks; PHI splits it into two records (PHI 175744 + PHI 140601, neither linking to the other); EDCS holds both halves as one row with a // separator. The same physical stone is data-modelled four different ways.
你已熟悉的案例(CIL X 7296 / IG XIV 297):一块双语石匠铺招牌。I.Sicily 用一条记录、两个 <div> 块容纳;PHI 拆成两条记录(PHI 175744 + PHI 140601,互不链接);EDCS 把两半放在同一行,用 // 分隔。同一块石头,被四种不同方式建模。
Practical implication: if you query "how many bilingual inscriptions in EDCS?" you have to write a regex that detects the // separator. If you query "how many bilingual in PHI?" you can't — PHI's records are language-specific, so the two halves of one stone count as two separate records.
实际意味:若问"EDCS 中有多少双语铭文",得写正则识别 // 分隔符。若问"PHI 中有多少双语",则无法,PHI 记录按语种切分,同一块石头的两半被计为两条独立记录。
If you work with inscriptions, you will sooner or later run into these adjacent projects. They share the same XML schema and bracket conventions; the data flows between them with surprisingly little glue.
做铭文研究迟早会撞上这些邻居项目。它们共享同一套 XML 模式与括号惯例;数据在它们之间几乎不需要"胶水"就能流动。
A look at how the papyri.info portal stitches together DDbDP texts, HGV metadata, and APIS images. The same architectural pattern is what inscription federation projects (EAGLE) aspire to.
看 papyri.info 如何把 DDbDP 文本、HGV 元数据、APIS 图像缝合在一起。这正是铭文联邦项目(EAGLE)所向往的架构。
USER PAPYRI.INFO BACKEND
────── ──────────────────
┌────────────────────┐
┌────────────┐ HTTP GET │ Solr index │
│ Browser │ ───────────────────▶ │ (full-text + IDs) │
└────────────┘ └─────────┬──────────┘
│
resolves TM-ID against:
│
┌──────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ DDbDP repo │ │ HGV repo │ │ APIS repo │
│ TEI XML │ │ TEI XML │ │ IIIF + DC │
│ (text) │ │ (metadata) │ │ (images) │
└──────────────┘ └──────────────┘ └──────────────┘
Github / Duke Heidelberg many universities
The trick: the user search and the data store are decoupled. Solr indexes everything (text + metadata + image presence) and resolves results by TM-ID. Whichever repository physically holds a given record, the portal links you to it. Editing an XML file in the GitHub repository feeds back into the portal within minutes.
关键点:用户检索与数据存储是解耦的。Solr 索引 一切(文本+元数据+图像存在性),按 TM-ID 解析结果。无论某条记录实际由哪家机构保存,门户都会把你链向它。在 GitHub 仓库编辑一份 XML 文件,几分钟后改动即在门户上可见。
What this required: (1) every record must have a TM-ID; (2) all three repositories must use the same identifier scheme; (3) the index must be rebuilt when source files change; (4) one institution must own the portal's uptime. Inscriptions don't have all four yet. EAGLE has aspired to (1) and (2); (3) and (4) are still in negotiation.
这要求:(1) 每条记录必须有 TM-ID;(2) 三个仓库必须共用同一标识符方案;(3) 源文件变动时索引必须重建;(4) 必须有一家机构对门户在线率负责。铭文界尚未同时具备这四点。EAGLE 已争取 (1)(2);(3)(4) 还在谈。
A real concern when working across projects: not all TEI is the same TEI, and not all EpiDoc is the same EpiDoc. The schema has gone through several major revisions. A query that works on one project's records may fail on another.
跨项目工作时的真实问题:不是所有 TEI 都是同一种 TEI,也不是所有 EpiDoc 都是同一种 EpiDoc。schema 已历经数次大修。在某项目记录上能用的查询,到另一项目可能失败。
| Version版本 | Year年份 | What changed变化 | Local sample本地样本 |
|---|---|---|---|
| TEI P4 | 2002 | DTD-based; original EpiDoc DTD基于 DTD;最初的 EpiDoc DTD | tei-epidoc.dtd in your folder |
| TEI P5 | 2007 | RNG; major restructuring; namespaces cleanRNG;大重构;命名空间清晰 | most current files |
| EpiDoc Guidelines 8 | 2010 | migration to TEI P5; many epigraphic projects converted around this time迁移至 TEI P5;许多铭文项目在此时前后转换 | IRT0001 revisionDesc records "2010-08-18 Converted from TEI P4 to P5" |
| EpiDoc Guidelines 9 | 2018 | refinements; explicit handling of multilingualism, certainty精细化;多语种、置信度的显式处理 | IGCyr files cite v8; IRT now cites v9.3 |
| EpiDoc Guidelines 10 | ~2024 | in development; better integration with linked-data vocabularies开发中;与关联数据词表更好整合 | not yet seen in your sample |
Practical impact: when you write a query that depends on a specific tag (e.g. <origDate notBefore="…">), it works on EpiDoc 8/9 but might fail on TEI P4 records (which use a different attribute). The honest move is to declare which schema version your query targets, and version-check before running.
实际影响:若查询依赖某具体标签(如 <origDate notBefore="…">),在 EpiDoc 8/9 上可用,但可能在 TEI P4 记录上失败(属性不同)。诚实做法是显式声明查询所针对的 schema 版本,并在运行前做版本检查。
A TEI XML file is hard to read directly. The ecosystem of tools that turn EpiDoc XML into HTML, into a reading interface, into a IIIF viewer, into a search index, into model training data — is what makes the data usable.
直接读 TEI XML 并不容易。把 EpiDoc XML 变成 HTML、阅读界面、IIIF 查看器、搜索索引、模型训练数据,这一整套工具生态,才让数据真正可用。
| Your question你要问的 | Start here从这里开始 | Why为什么 |
|---|---|---|
| "How often does X appear in Latin inscriptions?""X 在拉丁铭文中出现多频繁?" | EDCS | Largest Latin coverage; conservative + interpretive cleaning; partner_link bridges拉丁覆盖最广;含保守 + 解释清洗;有 partner_link |
| "Distribution of inscription types across Roman provinces""罗马行省内的铭文类型分布" | EDH | Province-aware metadata; Heidelberg keywords vocabulary行省感知的元数据;Heidelberg 关键词体系 |
| "Greek string search across all known Greek inscriptions""在所有已知希腊铭文上做希腊字串搜索" | PHI | Largest Greek text coverage; opaque numeric IDs; no images希腊文本覆盖最大;ID 不透明;无图 |
| "Material + dimensions + lemma analysis on a regional corpus""区域语料的材质 + 尺寸 + lemma 分析" | regional EpiDoc | TEI EpiDoc deep encoding; structured fields, citable per inscriptionTEI EpiDoc 深度编码;字段结构化、按条目可引 |
| "Comparison with Italian regional inscriptions""与意大利区域铭文做对比" | EDR | Italian-language metadata; partner of EAGLE federation意大利语元数据;EAGLE 联盟成员 |
| "Train a model on Greek epigraphic text""用希腊铭文文本训练模型" | iPHI / Ithaca | Pre-tokenized, deduplicated, train/val/test splits已切分 token、已去重,含训练 / 验证 / 测试集 |
| "Cross-reference an inscription with documentary papyri""把铭文与莎草纸文献交叉引用" | papyri.info + Trismegistos | TM-ID is the join key; APIS images on the same recordTM-ID 即联接键;APIS 图像同条记录可见 |
| "Squeeze archives + photographs for a specific stone""某块石头的压模档案 + 照片" | EDH (squeezes) + UbiErat Lupa (photos) + OSU KB | Squeeze archives are EDH's specialty; lupa.at is image-first; OSU CEPS for selected stones压模档案是 EDH 的专项;lupa.at 以图为本;OSU CEPS 选录若干石 |
Honest practice: for non-trivial questions, you will end up using two or three databases together. Plan the join before you plan the analysis.
诚实的做法:遇到非平凡的问题,你最终会同时用两到三个数据库。先把联接规划好,再去做分析。
| Database数据库 | Family家族 | Encoding编码 | ~Records条数 | Image图像 | Licence授权 |
|---|---|---|---|---|---|
| EDCS | 1 Aggregator | flat record | ~537,000 | 26.3 % | site |
| EDH | 1 Aggregator | structured | ~82,000 | photo + squeeze | CC BY-SA |
| Ubi Erat Lupa | 1 Aggregator (image) | image-first | ~50,000+ | core asset | site |
| I.Sicily | 2 Regional EpiDoc | TEI EpiDoc | ~3,500 | IIIF | CC BY 4.0 |
| IRT | 2 Regional EpiDoc | TEI EpiDoc | ~1,618 | JPG | CC BY 2.0 |
| IRCyr | 2 Regional EpiDoc | TEI EpiDoc | ~2,360 | JPG | CC BY 2.0 |
| IGCyr / GVCyr | 2 Regional EpiDoc | TEI EpiDoc | ~1,500 | selective | CC BY-NC 4.0 |
| IAph 2007 | 2 Regional EpiDoc | TEI EpiDoc + lemma | ~1,489 | JPG | CC BY 2.5 |
| IOSPE | 2 Regional EpiDoc | TEI EpiDoc bilingual | varies | selective | CC BY 4.0 |
| InsAph / Vindolanda | 2 Regional EpiDoc | TEI EpiDoc | ~700 | multispectral | CC BY-NC 4.0 |
| AshLI | 2 Regional EpiDoc | TEI EpiDoc | ~250 | IIIF | CC BY-SA |
| MAMA XI | 2 Regional EpiDoc | TEI EpiDoc | ~400 | selective | CC BY-NC-SA |
| PHI Greek | 3 Text-only | plain text | ~250,000 | none | site |
| McCabe corpora | 3 Text-only | plain text | tens of K | none | site |
| EDR | 4 EAGLE | structured / IT | ~80,000 | JPG | EAGLE |
| EDB | 4 EAGLE | structured | ~40,000 | JPG | EAGLE |
| EDV | 4 EAGLE | structured | varies | JPG | EAGLE |
| DDbDP | 5 Papyri | TEI EpiDoc | ~60,000 | linked to APIS | CC BY 3.0 |
| HGV | 5 Papyri | TEI EpiDoc (metadata) | ~70,000 | — (metadata only) | CC BY 3.0 |
| APIS | 5 Papyri | institutional | ~60,000 | core asset | varies |
| DCLP | 5 Papyri | TEI EpiDoc | ~10,000+ | linked | CC BY 3.0 |
| Trismegistos | 5 Papyri (resolver) | relational + RDF | ~800,000+ | — (joins only) | site / CC |
| papyri.info | 5 Papyri (federated) | TEI + JSON-LD | federated | via APIS | CC BY 3.0 |
| iPHI / Ithaca | 6 ML | tokenized text + labels | ~178,000 (post-dedup) | none | Apache 2.0 / CC BY |
| OSU KB / CEPS | institutional photo | Dublin Core | limited | single JPG / item | institutional |
| Pleiades | place gazetteer | JSON-LD | ~40,000 places | — (places) | CC BY 3.0 |
| EAGLE network | federation portal | resolver + vocabularies | n/a | n/a | varies |
Counts are approximate (most rows update over time). The "Family" column lets you read this table as a folded version of the previous slides.
条数为近似值(多数随时间更新)。"家族"一列让你能把此表读作前几张幻灯的折叠版。
Where next: the Database Literacy tutorial walks one stone across six landings; the Case Study shows seven concrete data-disparity issues; the Visual edition introduces the SDAM ETL pipelines that transform aggregator data into analysable parquet; the Reference covers all 37 SDAM repositories.
下一站:数据库识读 把一块石头沿六个落点走一遍;案例 展示七个具体的数据差异问题;视觉版 介绍 SDAM ETL 管道;参考版 覆盖 SDAM 全部 37 个仓库。
Format observations on this atlas were drawn directly from sample TEI XML files in the local epidoc folder (Aphrodisias, IRT, IRCyr, IGCyr, IOSPE, DDbDP) and from the SDAM ETL bulk dumps for EDH and EDCS.
本图中的格式观察直接取自本地 epidoc 文件夹中的 TEI XML 样本(Aphrodisias、IRT、IRCyr、IGCyr、IOSPE、DDbDP),以及 SDAM ETL 中 EDH 与 EDCS 的批量导出。