A single Sicilian inscription appears in six places on the web. Each version remembers something different. Each forgets something different. This tutorial walks the gap between the slab in Palermo and the row in your CSV.
同一块西西里铭文在网络上有 六 个落点。每一个版本记得的东西都不一样,遗忘的也不一样。本教程沿着这条裂缝走,从巴勒莫的那块石板,走到你 CSV 文件里的那一行。
Use ← → to navigate · click any underlined term for a definition · 点击任何带下划线的术语查看注释
使用 ← → 翻页 · 点击任何带下划线的术语查看注释
Every line in your CSV is the residue of a long chain. Stone → squeeze or photograph → printed transcription (CIL, IG, AE) → re-keyed into a database → cleaned by regex → exported as JSON → loaded into pandas. Each step carries a quiet loss. The closer you get to the stone, the fewer things you can do at scale; the farther you get from it, the more you can compute and the less you actually know.
CSV 中的每一行都是一条长链的残余。石头 → 压模或摄影 → 印本转写(CIL、IG、AE)→ 重新键入数据库 → 用正则清洗 → 导出为 JSON → 加载进 pandas。每一步都有一份默默的丢失。越靠近石头,可以批量做的事越少;越远离石头,能算的越多,可你实际知道的越少。
The thesis of this tutorial: "the digital corpus" is not one thing. It is a federation of partial views, each shaped by what its makers prioritised. Knowing each view's shape is the first move in honest computational epigraphy.
本教程的命题:所谓"数字化的语料库"不是一个东西。它是一个 由若干局部视角组成的联邦,每个视角都被其制作者的优先取向所塑形。看清每个视角的形状,就是诚实的"计算铭文学"的第一步。
A small marble plaque (~30 × 47 cm) once mounted at a Sicilian stonecutter's workshop, advertising its services in both Latin and Greek. Probably 1st century CE.
一块小型大理石板(约 30 × 47 厘米),曾立于一家西西里石匠铺前,用 拉丁语与希腊语 双语为店主招揽生意。年代大约公元 1 世纪。
Provenance: Palermo, Sicily. The plaque doubles as a sample of the workshop's lettering — a sign cut by the cutter, advertising the cutter.
出土:意大利西西里,巴勒莫。这块招牌同时也是店家手艺的样品,由石匠亲手刻下,给路人示范他能刻什么。
Tituli heic ordinantur et sculpuntur, aidibus sacreis, qum operum publicorum. στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις.
"Inscriptions are designed and cut here, [for] sacred buildings, with [those] of public works." Same content, twice, for two reading communities.
"此处可设计、雕刻铭文,[供]神庙[使用],并兼接公共工程之刻字。"内容一样,写两遍,给两批读者。
Why this stone? It is unusually well-attested — it appears in five online databases plus a university photograph repository — and unusually well-suited to the comparison, since both halves of the bilingual text foreground the act of cutting itself.
为何选这块?它在网上的存在感 异常密集:五个数据库加一座大学照片库都收录了它,且双语两端都把"刻字"这一动作摆在最前面,恰好适合用来谈"数字版本之间的差异"。
An inscription database is not a digital copy of a stone. It is a columnar table: rows are inscriptions, columns are decisions about what to record. Every column reflects an editorial bet — "we believe province / language / dating window / material / findspot are queryable; so we will fix a controlled vocabulary for them and ignore the rest."
铭文数据库不是石头的数字副本。它是一张 列状表格:行是铭文,列是关于"该记录什么"的决定。每一列都包含一种编辑式赌注,"我们认为省份/语言/年代窗口/材质/出土地是可查询的字段;那就给它们规定受控词表,其余的就忽略。"
Autopsy — physically examining the stone — sits behind every "good" record. But autopsy doesn't scale. So databases trade depth for breadth: they record what they can encode, then move on.
实地察看(亲自去看石头)是每一条"好记录"背后的源头。但实地察看不能批量。于是数据库选择以广度换深度:能 编码 的就记下来,剩下的就略过。
This is a 200-year-old trade-off. Mommsen's CIL (1853–) was already trading editorial depth (autopsy + apparatus + bibliography) for systematic coverage of the Latin world. Modern databases inherit the trade-off and adjust the ratio. → Full lineage这是一项 200 年的取舍。Mommsen 的 CIL(1853–)已在以"编辑深度(实地察看 + 校勘 + 参考书目)"换"对拉丁世界的系统覆盖"。现代数据库继承了同一取舍,只调整比例。→ 完整谱系
Click each card to open that database's record in a new tab. Notice that even the URLs disagree: each project has chosen its own ID system, and there is no master index that resolves them all.
点击每张卡片在新标签页中打开该数据库的记录。注意连 URL 都互不相同:每个项目都自定义了一套 ID 系统,目前没有一个总索引能把它们全部串起来。
Plus a meaningful absence: this stone is not in the EDH, the most photo-rich Roman-provincial database. Why? Because EDH's geographical scope stops at the Roman provinces; Italy itself is technically out-of-scope. The richest database is also the wrongest one to consult here.
外加一个有意义的缺席:这块石头 不在 EDH 当中,而 EDH 是图像最丰富的罗马行省数据库。为什么?因为 EDH 的地理范围只收"行省",意大利本土不在其中。图像最丰富的那个库,恰恰是查这块石头时最不该用的库。
photo field with concatenated JPG URLs (no IIIF, no zoom)inscription, conservative_cleaning, interpretive_cleaningDisparity: I.Sicily knows the dimensions of the slab and lets you zoom into a single letter. EDCS records 36× more inscriptions but doesn't know how big any of them is. Both can be useful — for different questions.
差异:I.Sicily 知道这块板有多大,并允许你缩放到单个字母。EDCS 收录了多 36 倍的铭文,却不知道任何一块石头的尺寸。两者各有所长,取决于你问的是什么问题。
SEG 50.1016 entry)This is normal, not pathological. PHI ingests records from many publication series (IG, SEG, AE, journal articles). When the same stone is published in two different series, PHI ingests it twice — under two PHI IDs. There is no automatic dedup. If you query "how many Greek inscriptions mention τυποῦνται" you will double-count this one.
这是常态,不是事故。PHI 从多套出版物(IG、SEG、AE、期刊论文)抓取记录。当同一块石头被两套出版物各自著录,PHI 就把它抓两次,给两个 PHI ID。没有自动去重。如果你查"希腊文铭文中提到 τυποῦνται 的有多少件",这一块会被 计两次。
PHI is the canonical Greek epigraphy text database. Strong on findable text, weak on everything else. Counts done on PHI alone — without joining to a richer record — are unreliable.
PHI 是希腊铭文的标准文本数据库。它的强项是 可检索的文本,弱项是 除文本以外的一切。仅在 PHI 上做计数,不与更丰富的记录联接,结果是不可靠的。
Lesson: Some "records" of an inscription are not records of inscriptions at all — they are records of photographs, kept by libraries on photo-management terms (Dublin Core, DSpace, ARK identifiers). The bibliography lives outside the photograph; the photograph lives outside the database. Joining them up requires you, the human reader, to do the work.
教训:一些"铭文记录"其实根本不是铭文记录,它们是 照片记录,被图书馆按照"图像管理"的逻辑保存(Dublin Core、DSpace、ARK 标识符)。文献列表在照片之外,照片在数据库之外。把它们串起来的工作,要由你,真人读者,自己来做。
The Epigraphic Database Heidelberg (EDH) is the most photo-rich Latin epigraphy database, with high-resolution images and squeeze archives. But its scope is the Roman provinces — the imperial territory ruled from Rome. Italy itself, including Sicily, is technically considered not a province but the heartland, and so is normally excluded from EDH's records.
海德堡铭文数据库(EDH)是图像最丰富的拉丁铭文数据库,有高分辨率照片和压模档案。但它的 范围 是 罗马行省:即罗马所统治的帝国领土。意大利本土(包括西西里)按惯例 不算行省,而是被视为本土核心,因此通常不被 EDH 收录。
Verified empirically. A search of the EDH bulk dump (EDH_attrs_cleaned_2023-04-26.json, 81,883 records) for the strings CIL 10, 7296, CIL X 7296, ISic000470, IG XIV 297, IG 14, 297 yields zero hits. The stone is not there.
经实证核查。在 EDH 批量导出(EDH_attrs_cleaned_2023-04-26.json,共 81,883 条记录)中搜索 CIL 10, 7296、CIL X 7296、ISic000470、IG XIV 297、IG 14, 297,零结果。这块石头确实不在其中。
This is the first lesson of database literacy: the absence of a record is itself information. If you query "where are inscriptions about stonecutters?" against EDH alone, you will conclude — wrongly — that there were none in the Italian heartland. The conclusion is an artefact of EDH's scope, not of antiquity.
这是数据库识读的第一课:记录的缺席本身就是信息。如果你只在 EDH 上查"提到石匠的铭文都在哪里",你会得出,错误的,结论:"意大利本土没有这样的铭文。"这一结论不是古代的事实,而是 EDH 范围的副作用。
| Field字段 | I.Sicily | EDCS | PHI ×2 | EDR | OSU KB | EDH |
|---|---|---|---|---|---|---|
| Latin transcription拉丁转写 | ✓ | ✓ | — | ✓ | — | — |
| Greek transcription希腊转写 | ✓ | ✓ | ✓ | ✓ | — | — |
| Diplomatic text外交本 | ✓ | conservative | — | partial | — | — |
| Interpretive text解释本 | ✓ | ✓ | ✓ | ✓ | — | — |
| Photograph照片 | ✓ IIIF | JPG URL | — | JPG | JPG (only) | — |
| Dimensions尺寸 | H/W/D + letter h. | — | — | partial | — | — |
| Material材质 | marble | lapis | — | marmo | — | — |
| Findspot lat/long出土地经纬度 | via Pleiades | 38.1115, 13.3516 | — | via place | — | — |
| Date range年代范围 | −100 / +100 | −100 / +100 | verbal | numeric | verbal | — |
| Bibliography参考文献 | structured | in publication field | header only | structured | free-text | — |
| Cross-IDs (partner_link)跨库 ID | via TM, Pleiades | explicit field | — | in commento | — | — |
| Open licence开放授权 | CC BY 4.0 | site terms | site terms | site terms | CC BY | — |
Read down a column: that's what one database knows. Read across a row: that's how unevenly one fact is recorded across databases. Notice that no single database has every column filled. Compositional knowledge requires querying many at once.
沿一列竖读:单一数据库知道什么。沿一行横读:同一事实在不同数据库中被记录得多么不均。注意 没有任何一个数据库把每一格都填满。要拼出"组合性的知识",必须同时查询多个。
→ For the family taxonomy behind these columns:对应六大家族分类: Atlas · Six families · Atlas · Comprehensive 28-row table
"EDCS and EDH provide images" is true at the database level. But within those databases, image coverage is sparse. We can quantify this directly from the bulk data dumps.
"EDCS 和 EDH 提供图像",在数据库层面这是对的。但在数据库内部,图像覆盖率非常稀疏。我们可以直接从批量导出数据中量化这一点。
Three out of four EDCS records have no image at all. For 396,000 inscriptions, the database is simply text — no visual control on the editor's reading.
EDCS 中四分之三的记录 没有 任何图像。对于 39.6 万条铭文,数据库就是纯文本,编辑者的释读没有任何视觉佐证。
EDCS records the inscription's text three times in three different fields. Each is a different distance from the stone's actual surface. Reading them side by side teaches you what "the text" means in epigraphy: it depends on which version you took.
EDCS 把铭文文本以三种不同方式存了三遍,分别置于三个字段中。每一种距离石头表面的远近不同。三种并排读,能学会"铭文文本"在铭文学中究竟意味着什么,取决于你拿了哪一版。
Question for the class: if you ran a regex search for the exact string "heic" across the EDCS interpretive-cleaning column, you would miss this inscription. Is that an EDCS bug, or a feature?
课堂提问:如果你在 EDCS 的"解释本"列上用正则搜 "heic" 这个原拼写,会漏掉这条记录。这是 EDCS 的错,还是它的特性?
Each database invented its own ID system. To find this stone elsewhere, you have to know all of them. EDCS makes a partial attempt: in its partner_link field it writes out the inscription's identifiers in five other systems.
每个数据库都自定义了一套 ID 系统。要在别处找到这块石头,你必须 同时知道 所有这些 ID。EDCS 部分尝试解决这个问题:在它的 partner_link 字段中把铭文在其他五种系统中的标识符列出来。
https://edcs.hist.uzh.ch/en/ N140617; → EDR 140617 cil10_p758; → CIL 10 page 758 (the printed CIL volume) P140601; → PHI 140601 is000470; → I.Sicily ISic000470 t2016.031.04; → Tyche 2016 article de-b0810; → DAI Berlin photo archive oh_99008; → OSU Knowledge Bank handle 1811/99008 bu_CIL&volume=10&insc=7296 → Berlin CIL backend ref
What this field does not do:
这个字段 没 做的事:
Trismegistos aspires to be the universal join key — a single TM-ID that resolves across all databases — but coverage is uneven and not every record carries one. EAGLE was meant to harmonise; in practice each federation member kept their conventions.
Trismegistos 试图扮演"通用联接键"的角色,一个能在所有数据库间解析的 TM-ID,但覆盖参差不齐,并非每条记录都有。EAGLE 联盟原本希望统一,实际上每个成员都保留了自己的旧惯例。
EDCS records this stone's publication field as a single string of equal-signs:
EDCS 把这块石头的 publication 字段记录为一连串以等号串联的引用:
CIL 10, 07296 = D 07680 = IG-14, 00297 = IGRRP-01, 00503 = ZPE-130-239 = ZPE-177-131 = SEG-50, 01016 = AE 1982, 00417 = AE 1999, 01543 = AE 2000, 00643 = AE 2011, +00437 = AE 2012, +00014 = ZPE-205-145 = Tyche-2016-54 = AE 2018, +00130 = AE 2018, +00756
That is the publication history of one stone, compressed. Reading top to bottom is reading forward through 130+ years of scholarship:
这是一块石头的出版史的压缩版。从上往下读,就是顺着 130 多年的学术史前进:
| Citation引用 | What it is是什么 | Date年代 | Distance from stone距石头 |
|---|---|---|---|
CIL 10, 7296 | CIL volume 10 — Mommsen's printed Latin transcription | 1883 | closest (autopsy) |
IG XIV, 297 | IG volume 14 — Kaibel's printed Greek transcription | 1890 | closest (autopsy) |
D 7680 | Dessau, ILS — selected reprint with light commentary | 1892–1916 | +1 step |
IGRRP I, 503 | Cagnat, Inscriptiones Graecae ad Res Romanas Pertinentes | 1911 | +1 step |
ZPE 130, 239 | Kruschwitz article — sprachliche Anomalien | 2000 | discussion only |
SEG 50, 1016 | Supplementum Epigraphicum Graecum — citation aggregator | 2003 | discussion only |
ZPE 177, 131 | Tribulato article — bilingual stonecutter sign | 2011 | discussion only |
AE 2018, 130 | L'Année Épigraphique — yearly citation roundup | 2018 | citation about citation |
Most modern citations refer to the inscription via previous editions, not via fresh autopsy. By 2018, an article citing AE 2018 §130 is two or three steps removed from anyone who actually saw the stone.
绝大多数现代引用都是 经由 前人的版本来指这块石头,而非通过新一次的实地察看。到了 2018 年,一篇引 AE 2018 §130 的论文,与任何"亲眼见过石头的人"已隔了两到三层。
No epigraphic database covers everything. Each carved out a domain — by region, by language, by script, by period. Where a stone happens to land determines who notices it.
没有任何一个铭文数据库覆盖一切。每个都依据地区、语言、文字、时段划出自己的范围。一块石头落在哪里,决定了谁会注意到它。
| Database数据库 | Region/scope范围 | Language语言 | Approx. records记录数 | Includes our stone?收本案? |
|---|---|---|---|---|
| EDH | Roman provinces (excl. Italy) | Latin (mostly) | ~82,000 | no — Italy is out of scope |
| EDCS | any Latin inscription | Latin (any) | ~537,000 | yes |
| PHI | any Greek inscription | Greek | ~250,000 | yes (twice) |
| EDR | Italy + Italian islands | Latin (Italian-language metadata) | ~80,000 | yes |
| I.Sicily | Sicily only | any (Greek + Latin + Punic + Hebrew) | ~3,500 | yes |
| OSU KB / CEPS | OSU's photographic archive | any (photo only) | n/a | yes (one photo) |
The implication for queries: a question like "how often do Sicilian inscriptions mention public works?" is technically about all inscriptions found on Sicily — but no single database has them all. EDH is silent. EDCS has Latin Sicilian inscriptions but loses the bilingual halves. PHI has the Greek halves but loses the dating metadata. I.Sicily has both halves of some stones but only ~3,500 records total. The honest answer requires federating all of them — and accepting that joins will be partial.
对查询的含义:"西西里铭文中提及公共工程的频率有多高?",严格说这是一个关于 所有 西西里出土铭文的问题,但没有哪个单一库收齐了全部。EDH 不记。EDCS 有拉丁面,但希腊面被剥落。PHI 有希腊面,但年代元数据丢失。I.Sicily 二者皆有,但只覆盖约 3,500 条。诚实的答案要求 联邦化 这些库,并接受联接必定有缺口。
conservative_cleaning + interpretive_cleaning丢失:修复、标记符;产出:conservative_cleaning + interpretive_cleaningWhere you stand on the ladder dictates what you can see. A philological question (is "Qum" a graphical archaism for "cum"?) needs Step 4. A geo-statistical question (where do bilingual workshops cluster?) needs Step 7. Stepping up the ladder for a question it isn't built for produces wrong answers without warning.
你站在阶梯的哪一级,决定了你能看见什么。语文学问题("Qum"是否"cum"的字形古风?)需要第 4 级。地理统计问题(双语作坊在哪聚集?)需要第 7 级。把不该上层的问题上推一级,会得到错答,而且不会有警告。
Once an inscription becomes a row in EDCS or a TEI XML in I.Sicily, the row begins to circulate independently. It gets quoted in journal articles, mapped onto interactive viewers, ingested by ML training pipelines, and counted in distant-reading studies. At each step, the row carries less and less of the chain that produced it.
一块铭文一旦在 EDCS 中变成一行、或在 I.Sicily 中变成一份 TEI XML,这一行就开始独立流通。它被期刊论文引用,被嵌入交互式查看器,被机器学习训练管道吞下,被"远距阅读"统计研究计数。每一次流通,行内携带的"产生这行的那条链"就少一点。
A test: the next time you read a sentence like "There are X bilingual inscriptions from Roman Sicily", ask: which database produced X? Which fields did it count? Was the duplication into PHI 175744 / 140601 deduplicated? Did the count include I.Sicily's small set, or only EDCS? If the article doesn't say, the number is detached.
一个测试:下次读到 "罗马时期西西里岛共有 X 件双语铭文" 这样的句子,请追问:X 来自哪个数据库?数了哪些字段?PHI 175744 / 140601 这种重复条目去重了吗?是否纳入 I.Sicily 的小集?还是只用 EDCS?如果文章没说,这个数字就是脱钩的。
These are not gaps that better column choices would fix. They are the irreducible cost of trading depth for breadth. The honest digital epigrapher names them, then proceeds.
这些不是"加几列就能补上"的缺口,而是"以广度换深度"的不可化约成本。诚实的数字铭文学者会先命名它们,再继续工作。
If you ask each of the six landings "what does this stone say in Latin?", you get six slightly different answers. Mostly the differences are typographic — punctuation, line breaks, whether brackets are visible — but a few are substantive.
如果你向六个落点分别问"这块石头的拉丁面写了什么?",你会得到六个稍有出入的答案。多数差异是排版性的,标点、换行、是否保留方括号,但少数是实质性的。
| Source来源 | Line 1 of Latin face拉丁面第一行 | Treats "heic" as如何处理 "heic" | Treats "Qum" as如何处理 "Qum" |
|---|---|---|---|
| I.Sicily (diplomatic) | Tituli h{e}ic | preserved with editorial markers | preserved with marker |
| I.Sicily (interpretive) | Tituli hic | normalised to "hic" | normalised to "cum" |
EDCS (raw inscription) | Tituli / h{e}ic | preserved | preserved with <c=Q> |
| EDCS (conservative) | Tituli heic | "heic" preserved | "Qum" preserved |
| EDCS (interpretive) | Tituli hic | normalised | normalised |
| EDR (Italian record) | Tituli heic | preserved | preserved as "Qum" |
| OSU KB (no transcription) | — (only photograph) — | n/a | n/a |
The lesson: when you see "the EDCS text of CIL X 7296" in a paper, ask which EDCS text. There are at least three. The choice between them is editorial — and so are the conclusions you can draw from a search regex.
结论:当你在论文里读到"EDCS 中 CIL X 7296 的文本"时,请追问 哪一版 EDCS 文本。至少有三版。三选一是编辑判断,你能从一条正则得出的结论,也是。
Heřmánková, Kaše & Sobotková (JDH 2021) — the methodological paper behind the SDAM ETL pipelines — articulates three commitments that respond directly to the problems on the previous slides:
Heřmánková、Kaše 与 Sobotková(JDH 2021),SDAM ETL 管道背后的方法论论文,提出三项承诺,正面回应上述问题:
Every figure must be re-derivable from the same code and data. So readers can examine where database choices show up in conclusions.
每张图必须能由同一份代码与数据重新生成。让读者看清"数据库选择"在结论中如何显形。
Dating uncertainty must be propagated, not hidden. (Hence tempun Monte Carlo simulation over date ranges, instead of point estimates.)
日期不确定性必须向下传播,不能掩盖。(因此用 tempun 蒙特卡洛模拟来跑年代区间,而非点估计。)
Disagreement between EDH and EDCS is not noise to be averaged out — it's scholarly evidence about how the database was made.
EDH 与 EDCS 之间的差异不是要平均掉的噪声,它是关于 数据库是如何被造出来的 的学术证据。
Read the three Stance cards in Paper Edition § 7 for the original framing. The case study in Case Study Edition walks the same problem in seven concrete issues. The Databases Atlas classifies all the projects discussed here into six families.
原文表述见 论文版 § 7 的三张"立场"卡。案例版 把同一组问题以七个具体的"问题号"逐一展开。数据库地图 把这里讨论的所有项目归入六大家族。
Pick any inscription from EDH. Now try to find the same inscription in EDCS, PHI, EDR. How many of the four hold a record? How many hold an image?
从 EDH 中任挑一条铭文。然后试着在 EDCS、PHI、EDR 中分别找到这同一块。四个库中有几个收录?几个有图?
In EDCS's bulk dump, run regex search "heic" against (a) the raw inscription field, then (b) the interpretive_cleaning field. How many hits each? Why the difference?
在 EDCS 批量数据上分别对 (a) 原始 inscription 字段、(b) interpretive_cleaning 字段做正则搜索 "heic"。各得几个命中?差异从何而来?
Write a short paragraph (≤200 words) on what is lost by collapsing CIL X 7296's diplomatic, conservative, and interpretive transcriptions into one column for downstream analysis. Cite the EDCS field names you would prefer to keep separate, and why.
写一段 ≤200 字的短文,论 CIL X 7296 的"外交本/保守清洗/解释本"如果合并成一列做下游分析,会丢失什么。点名你希望保留为独立字段的 EDCS 列名,并说明理由。
Where to do this: Reference Edition § 4.5 (link) shows the actual SDAM Python loops you'd modify to run Exercise 2. The Case Study Issue 2 page has the side-by-side text fields for Exercise 3.
在哪做:参考版 § 4.5(链接)有真实的 SDAM Python 循环代码,可改写来跑练习 2。案例版 · 问题 2 已并排呈现练习 3 所需的文本字段。
Where next: the Case Study walks one inscription through seven concrete issues. The Visual edition introduces the SDAM ETL pipelines that produce the cleaned data. The Reference covers all 37 SDAM repositories. The Paper walks the JDH 2021 article that proposes the methodology.
下一站:案例版 把一块铭文沿着七个具体问题走一遍。视觉版 介绍 SDAM ETL 管道。参考版 覆盖 SDAM 全部 37 个仓库。论文版 沿着 JDH 2021 一文走它提出的方法论。
EDCS_merged_cleaned_attrs_2022-09-12.json · SDAM ETL public dump · 537,262 recordsEDH_attrs_cleaned_2023-04-26.json · SDAM ETL public dump · 81,883 recordsAll counts and field-level findings on the previous slides were computed directly against the bulk data dumps in the SDAM repository — not summarised from secondary sources.
前面页中的所有计数与字段级发现,均直接从 SDAM 仓库中的批量数据导出文件计算得来,并非援引二手资料。