SDAM ETL Case Study案例研究

One inscription, five databases.

一块铭文,五个数据库

A case study in why "treating inscriptions as data" is harder than it looks — using one bilingual marble plaque from Roman Sicily.
为什么"把铭文当作数据"比看起来难得多,以一块罗马时代西西里出土的双语大理石板为案例。

A small marble plaque, 15.5 × 24.5 × 3 cm, was first recorded around 1730 in the Museo Salnitriano in Palermo and now sits in the Museo Archeologico Regionale Antonino Salinas. It carries seven lines of Greek on the left and seven lines of archaic Latin on the right, separated by a deeply incised vertical rule. The text is — fittingly — an advertisement for a stonecutter's shop, in two languages, telling passers-by that "inscriptions are designed and carved here for sacred temples in connection with public works." The inscription literally advertises the production of inscriptions.

一块小型大理石板,尺寸 15.5 × 24.5 × 3 厘米,约 1730 年首次记录于巴勒莫的 Museo Salnitriano,现藏于 Museo Archeologico Regionale Antonino Salinas(萨利纳斯考古博物馆)。板面被一道深刻的竖线分成两栏:左栏七行希腊文,右栏七行带古拼写的拉丁文。文本本身是一则石匠铺的广告,用两种语言告知路人:"此处为神圣庙宇及公共工程刻制铭文。"换言之,这块铭文广告的内容,正是"刻制铭文"。

It is the perfect case study for what Heřmánková, Kaše & Sobotková (JDH 2021) are up against. This single object appears, with subtly different metadata, in every major Latin and Greek epigraphic database — except one. Walking through how each database represents it shows in concrete detail why building a clean, comparable, cross-database dataset is so much harder than the abstract ETL-pipeline diagrams suggest.

Heřmánková, Kaše & Sobotková(JDH 2021)所讨论的问题而言,这是一个绝佳案例。这一件文物出现在几乎所有主要的拉丁文与希腊文铭文数据库里,只有一个例外,而每个库给它的元数据都略有不同。逐一查看各库的呈现方式,可以具体看到:"建立一份跨库可比的干净数据集"为什么远比抽象的 ETL 流水线图所暗示的更难。

§ 1The text铭文文本

The inscription as carved on the stone. Hover over a Greek line to highlight the Latin equivalent — they are not parallel translations but loose biversion, with awkward word-for-word renderings on both sides.

石面上的文字。把鼠标移到希腊文某一行,对应的拉丁文会高亮,两栏并非整齐对译,而是一种"双版本"翻译,两边都带着字字直译的笨拙。

CIL X 7296 / IG XIV 297 / ISic000470 — bilingual stonecutter's sign from Palermo, photograph
The actual marble plaque, ~30 × 47 × 6 cm, currently housed at the Museo Archeologico Salinas, Palermo. CIL X 7296 = IG XIV 297 = ISic000470 = EDCS-22000882 = EDR140617 = PHI 175744 = PHI 140601. Greek column on left, Latin column on right, separated by a deeply incised vertical groove. This is what every database row in this case study traces back to. 这块大理石板(约 30 × 47 × 6 厘米)现藏巴勒莫 Salinas 国家考古博物馆。CIL X 7296 = IG XIV 297 = ISic000470 = EDCS-22000882 = EDR140617 = PHI 175744 = PHI 140601。左为希腊栏,右为拉丁栏,中间一道深刻的纵向凹槽。本案例研究中所有数据库的"那一行"都追溯于此。
col. I · Greek
στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις.
col. II · Latin
tituli heic ordinantur et sculpuntur aidibus sacreis qum operum publicorum.
"Stelai / inscriptions are designed and carved here, for sacred temples, together with [those of] public works." "此处为神圣庙宇刻制题铭与公共工程之铭。" — translation by Jonathan Prag, I.Sicily
A note on the Latin.关于拉丁文。 The Latin column is in archaic spelling: heic (later hic), aidibus sacreis (later aedibus sacris), qum (later cum). The archaisms led 19th-c. editors to date the plaque to the late Republic; modern paleographic and spelling analysis (Wilson 1990) puts it Augustan to Julio-Claudian, while Manni Piraino preferred late 2nd c. CE. The dating disagreement matters — see Issue 3. 拉丁栏使用古拼写:heic(后期 hic)、aidibus sacreis(后期 aedibus sacris)、qum(后期 cum)。19 世纪学者据此把它定到共和晚期;现代字体学与拼写研究(Wilson 1990)把它定在奥古斯都至尤利-克劳狄王朝;Manni Piraino 则倾向公元 2 世纪末。这一定年分歧有实际后果:见后文问题 3。

§ 1.5From paper to digital — 140 years of editions从纸到数字,一百四十年的版本史

Before any of the modern databases existed, this inscription had already been published twice in the great 19th-century print corpora. Each editor made different choices. Each digitization that followed inherited some of those choices and silently dropped others.

在任何现代数据库出现之前,这块铭文早已被两次收入 19 世纪的大型印本丛刊,每位编者做出不同的取舍,而其后的每一次数字化又继承一些、悄悄丢掉另一些。

The two editors: Theodor Mommsen (1817–1903), founder of CIL and 1902 Nobel laureate, edited CIL X (1883), the volume covering southern Italy and Sicily. Georg Kaibel (1849–1901) edited IG XIV (1890), which collected the Greek inscriptions of Sicily, Italy, and the West. They were near-contemporaries working in the same Berlin academic ecosystem; their two editions of this stone disagree in small but legible ways. The disagreements propagated. → Full print-authority lineage两位编者:Theodor Mommsen(1817–1903),CIL 创立者,1902 年诺贝尔文学奖得主,主编 CIL X(1883,涵盖南意大利与西西里)。Georg Kaibel(1849–1901),主编 IG XIV(1890,收西西里、意大利、西部希腊文铭文)。两人是几乎同代的柏林学界同事;他们对这块石头的两版释读在细微但可见之处有分歧。这些分歧此后一路传播。→ 完整印本权威谱系

The chain stretches at least eight steps long: physical stone (1st c. CE) → 18th-century manuscript transcription (Torremuzza, Ignarra) → CIL X 7296 (Mommsen, 1883)IG XIV 297 (Kaibel, 1890) → 20th-century revisions (Wilson, Manni Piraino, Bivona) → five modern digital databases → SDAM's cleaning pipelines → JDH 2021 paper. Look at the two print editions side by side:

这条链条至少有八环:实物石头(公元 1 世纪)→ 18 世纪手抄记录(Torremuzza、Ignarra)→ CIL X 7296(Mommsen,1883)IG XIV 297(Kaibel,1890)→ 20 世纪诸修订(Wilson、Manni Piraino、Bivona)→ 五个现代数字数据库→ SDAM 的清洗流水线 → JDH 2021 论文。把两个 19 世纪印本并置:

Original print pages (scans)印本原页(扫描)

CIL X 7296 — Mommsen 1883, scanned page
CIL X 7296. Mommsen, 1883. Latin face, with editor's commentary in Latin prose above. CIL X 7296。Mommsen,1883 年。拉丁面,编辑者以拉丁文撰写的评论位于上方。
IG XIV 297 — Kaibel 1890, scanned page
IG XIV 297. Kaibel, 1890. Greek face, plus Kaibel's apparatus and his quotation of Mommsen's CIL discussion. IG XIV 297。Kaibel,1890 年。希腊面 + Kaibel 的校勘记 + Kaibel 引用 Mommsen 在 CIL 中的论述。

These are the actual pages — what every database in this case study ultimately quotes. Compare them with the cleaned facsimiles below: where the editorial discussion above the inscription gets dropped in modern databases, this is the prose lost.

这是真实的页面,本案例所有数据库最终都引自此。把它们与下方的清晰摹本对照阅读:现代数据库丢失了铭文上方的编辑论述,此处即所失之文。

7296 originis incertae, sed vix urbanae; immo Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis marmorarium commendantes. Panormi olim apud Iesuitas, nunc in museo publico. ϹΤΗΛΑΙ ΕΝΘΑΔΕ ΤΥΠΟΥΝΤΑΙ ΚΑΙ ΧΑΡΑϹϹΟΝΤΑΙ 5 ΝΑΟΙϹ ΙΕΡΟΙϹ ϹΥΝ ΕΝΕΡΓΕΙΑΙϹ ΔΗΜΟϹΙΑΙϹ TITVLI HEIC ORDINANTVR ET SCVLPVNTVR AIDIBVS SACREIS CVM OPERVM PVBLICORVM stone: QVM Recognovi. Torremuzza Pal. (1762) n. 37 (inde Ignarra de palaestra Neap. p. 51) = Sic. II, 1. Inde C. I. Gr. n. 5544; Orelli 4222. Marmorarius hic utriusque linguae infantiam prae se fert; nam ἐνέργεια de opere publico non magis recte dicitur quam recte se habet genetivus post cum praepositionem, quamquam aedes sacrae et opera publica sollemni signi- ficatione usurpantur. Mommsen · CIL X 7296 (Berlin, 1883) Corpus Inscriptionum Latinarum, vol. X
CIL X 7296 — Mommsen, 1883
CIL X 7296,Mommsen, 1883 年
297 Panormi in museo universitatis TORR. Lapis, qui nunc servatur in museo publico, est originis incertae; 'Siculam originem prodit, quod bilinguis est' MOMMSEN. ϹΤΗΛΑΙ ΕΝΘΑΔΕ ΤΥΠΟΥΝΤΑΙΚΑΙ ΧΑΡΑϹϹΟΝΤΑΙ 5 ΝΑΟΙϹΙΕΡΟΙϹ ϹΥΝΕΝΕΡΓΕΙΑΙϹ ΔΗΜΟϹΙΑΙϹ TITVLI HEIC ORDINANTVRET SCVLPVNTVR AIDIBVSSACREIS CVMOPERVM PVBLICORVM Στῆλαι ἐνϑάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις. normalized Greek Descripsi. Torremuzza Pán. p. 19, 37 (cf. p. 231), Sic. II 7; inde Ignarra de palaestra Neapol. p. 51 et Franz 5554. Accurate Mommsen CIL X 7296. Marmorarius nec Graecus opinor nec Romanus homo cum ab utriusque linguae peritis intelligi cuperet neutris satisfecit. Kaibel · IG XIV 297 (Berlin, 1890) Inscriptiones Graecae, vol. XIV
IG XIV 297 — Kaibel, 1890
IG XIV 297,Kaibel, 1890 年

Three things the print editions kept

印本保留的三件事

  • Argumentation. Both editors record their reasoning: Mommsen argues "Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis" — the Sicilian origin is shown by the bilingualism, the dating by the high quality of the letterforms. None of the modern databases preserves this reasoning chain. They give you a place and a date; the print editions give you why.
  • 论证。两位编者都记录了他们的推理:Mommsen 写道 "Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis"(双语性显示其西西里出身,字体之精美显示其年代)。任何现代数据库都没有保留这条推理链。它们告诉你"地点"和"日期",印本告诉你为什么
  • Provenance history. Mommsen writes "Panormi olim apud Iesuitas, nunc in museo publico" — formerly with the Jesuits in Palermo, now in the public museum. EDR has only "Palermo?" with a question mark. The 1762 → 1883 → modern chain is in the print but lost from the digital.
  • 出土流转史。Mommsen 写"Panormi olim apud Iesuitas, nunc in museo publico"(曾在巴勒莫耶稣会,现藏公共博物馆)。EDR 仅记"Palermo?"。1762 → 1883 → 现今的流转链在印本里清晰,在数字层却已丢失。
  • Editorial judgment. Mommsen calls the stonecutter's bilingualism infantia ("infancy / incompetence"). Kaibel demurs: "nec Graecus opinor nec Romanus homo" ("a man neither Greek nor Roman, in my opinion"). Two scholarly judgments coexist on the same artifact. The modern databases collapse both into a single normalized "type: epitaph" or "type: advertisement."
  • 编者判断。Mommsen 把石匠的双语水平称作 infantia("幼稚/不通")。Kaibel 不同意,他说 "nec Graecus opinor nec Romanus homo"("我以为他既非希腊人也非罗马人")。同一件文物上有两种学术判断并存。而现代数据库把两者都压缩为统一的 "type: epitaph" 或 "type: advertisement"。

Three things the print editions changed (and the digital inherited)

印本改动了的三件事(数字层一并继承)

  • The line "QVM OPERVM" became "CVM OPERVM." Both Mommsen (1883) and Kaibel (1890) silently regularize the archaic QVM on the stone to CVM. Look at line 6 of both columns above — the orange marker on the CIL flags it. The actual stone says QVM (visible in the photograph in the previous section). EDCS, EDR, and PHI 140601 all inherit "CVM" from the print editions. Only PHI 175744 and I.Sicily preserve "QVM" — and only I.Sicily encodes both forms with EpiDoc <choice> markup. The textual normalization that begins on Mommsen's printing block in 1883 is still propagating in 2024.
  • "QVM OPERVM" 被改作 "CVM OPERVM"。Mommsen(1883)与 Kaibel(1890)都默默把石上古拼写 QVM 规范化为 CVM。请看上面两栏第 6 行,CIL 上的橙色框出了这一点。原石上写的是 QVM(前一节的照片可见)。EDCS、EDR、PHI 140601 都从印本继承了 "CVM"。唯有 PHI 175744 与 I.Sicily 保留 "QVM",而只有 I.Sicily 用 EpiDoc <choice> 同时记录两种形式。1883 年 Mommsen 印刷版的文字规范化,一路延续传播到 2024 年。
  • Word-spacing got invented. The actual stone has no word dividers. Mommsen (CIL) introduces them in some lines (e.g. NAOIϹ ΙΕΡΟΙϹ). Kaibel (IG) runs everything together (ΝΑΟΙϹΙΕΡΟΙϹ). The modern databases mostly follow Mommsen, but the 19th-century split is still visible in PHI 175744 vs PHI 140601's transcription differences.
  • 词间空格是被"发明出来的"。原石上没有词间分隔符。Mommsen(CIL)在某些行加入空格(如 NAOIϹ ΙΕΡΟΙϹ),Kaibel(IG)则全部连写(ΝΑΟΙϹΙΕΡΟΙϹ)。现代数据库大多沿用 Mommsen 的版本,但 19 世纪的这个分歧依旧可在 PHI 175744 与 PHI 140601 的转写差异中看到。
  • The Greek got a "diplomatic" version. Kaibel below the columns supplies a single normalized Greek sentence with proper accents, breathings, and spacing: Στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις. Mommsen does not. This is the line the modern databases (PHI especially) descend from for their text. The "canonical" Greek text in PHI today is Kaibel's 1890 reading, four iterations away from the stone.
  • 希腊文得到一个"规范"版。Kaibel 在两栏下方补上一行带重音、气音、词间空格的规范希腊文:Στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις. Mommsen 没有。这条规范行就是现代数据库(尤其是 PHI)的希腊文文本来源。今天 PHI 中那条"标准"希腊文,其实就是 Kaibel 1890 年的读法,距离原石已转录四次。

What the digital era discarded entirely

数字时代彻底丢掉的东西

The Latin commentary paragraphs you can see in both print snapshots — Mommsen's argument about marble quality, Kaibel's quotation of Mommsen's verdict, both editors' interpretations of the bilingual stonecutter — are gone. None of the five modern databases has a structured field for "editor's interpretive commentary." EDR records "Textus secundum (6)" — meaning "text follows reference 6 [Manni Piraino 1973]." That's it. The argumentative scholarly culture of 19th-century epigraphy was replaced by a database row. I.Sicily's TEI <commentary> field is the only modern tool that admits prose commentary back into the structured record — and it's the only modern tool that does, which is why I.Sicily looks so much richer than the others on this case study. (For the underlying TEI block structure that supports this — msDesc, physDesc, history, apparatus, named entities, certainty markers — see the Atlas · EpiDoc deep-dive series.)

两个印本可见的拉丁文评注段落,Mommsen 关于大理石优质的论证、Kaibel 对 Mommsen 判断的引用、两位编者对石匠双语能力的解读,在数字时代全部消失。五个现代数据库中没有一个有"编者诠释性评注"这个结构化字段。EDR 仅记 "Textus secundum (6)",意为"文字从第 6 号参考文献(Manni Piraino 1973)",仅此而已。19 世纪铭文学的论证性学术文化被一行数据库记录所替代。I.Sicily TEI 的 <commentary> 字段是唯一允许散文评注重新进入结构化记录的现代工具,这也正是为什么 I.Sicily 在本案例研究中看起来比其他库都丰富得多。

The pattern, in one sentence.规律一句话。 Each transcription discards what it cannot encode. Stones lose nothing visually but require a viewer; print editions lose visual evidence but encode argument; digital databases lose argument but enable scale. The JDH paper operates in the last layer — and its work is to compensate for that loss. 每一次转录都丢掉它无法编码的东西。石头视觉上不丢任何东西,但需要观者亲临;印本丢掉视觉证据,但保留了论证;数字数据库丢掉论证,但获得了规模。JDH 论文工作于最后一层,它要做的,正是为这种损失做出补偿。

§ 2Five database views五个数据库的视角

Five major epigraphic databases (and one notable absence) record this same inscription. Click any to open its actual record.

五大铭文数据库(外加一个引人注目的"缺席")都收录了这块铭文。点击卡片打开真实记录。

I.Sicily
ISic000470
Oxford / Prag. EpiDoc TEI XML, full physical description, autopsy 2017, marble pXRF analysis, 25+ bibliography entries, high-res tiled image, CC-BY 4.0, Zenodo DOI.
牛津 / Prag 主持。EpiDoc TEI XML,完整器物描述,2017 年现场自验,大理石 pXRF 成分分析,25+ 条参考文献,高分辨率分块图像,CC-BY 4.0 许可,附 Zenodo DOI。
Visit访问
EDR · Roma
EDR140617
Sapienza Roma. Italian EAGLE federation member. Latin metadata fields, full text with editorial corrections, ~14 bibliography entries, photo gallery (separate page).
罗马 Sapienza 大学。意大利 EAGLE 联盟成员。拉丁文元数据字段,附编校修正的全文,约 14 条参考文献,照片画廊独立页面。
Visit访问
EDCS · Clauss/Slaby
EDCS-22000882
Manfred Clauss / Universität Zürich. The largest Latin corpus by count, but with the lightest editorial layer. Flat HTML record, no API. The corpus the JDH paper scrapes.
Manfred Clauss / 苏黎世大学。条目最多的拉丁文语料,但编辑层最薄。扁平 HTML 记录,无 API。JDH 论文抓取的就是这个库。
Visit访问
PHI Greek · 1
PH175744
Packard Humanities Institute. Indexed under IGLPalermo 139. Greek text, archaizing Latin transcribed as qum operum. Bibliography names three competing dates.
帕卡德人文研究院。索引名 IGLPalermo 139。希腊文 + 古风拉丁文,第 12 行作 qum operum。参考文献并列三种相互冲突的定年。
Visit访问
PHI Greek · 2
PH140601
PHI again. Indexed under IG XIV 297. Same physical inscription, but transcribed as cum operum (not qum) and listed as "undated". The same database lists this inscription twice with different transcriptions.
同样是 PHI。索引名 IG XIV 297。同一块石头,但第 12 行作 cum operum(不是 qum),并标注为"无定年"。同一个数据库把它两次记录,转写却不同。
Visit访问
EDH · Heidelberg
— absent —— 缺席 —
The peer-reviewed corpus the JDH paper treats as its highest-quality source. Does not contain this inscription (likely outside its provincial scope, since EDH historically focused on the Latin-Western provinces and limes). Anyone analyzing only EDH would miss the bilingual Sicilian advertising stele entirely.
JDH 论文视为最高质量来源的同行评议语料。不收录这块铭文(很可能因为超出其行省范围,EDH 历来侧重拉丁西部行省与边境地带)。只用 EDH 做分析的人会完全错过这块西西里双语广告石。

§ 3Side-by-side comparison逐项对比

Same physical object. Five databases. Pick any field to see how they disagree.

同一件实物,五个数据库。每一栏都能看出彼此分歧。

Field字段 I.Sicily EDR EDCS PHI 175744 PHI 140601
Database ID数据库编号 ISic000470 EDR140617 22000882 PH175744 PH140601
Indexed under主收录于 Native ID本号 Native ID + TM本号 + Trismegistos Native ID本号 IGLPalermo 139 IG XIV 297
Region地区 Italy / Sicily / Palermo Sic? / Panhormus? / Palermo? unknown未知 Sikelia — Prov. unkn. [Palermo] Sikelia — Prov. unkn. [Palermo]
Inventory no.馆藏号 Salinas, inv. 3574 Salinas, inv. 8822 not recorded未记录 not recorded未记录 not recorded未记录
Width (cm)宽度 (cm) 24.5 14.5
Date range定年范围 1–200 CE 100 BCE – 100 CE late 2nd c. AD; or late Repub.; or 1st c. AD2 世纪末;或共和晚期;或 1 世纪 undated无定年
Latin line 12拉丁第 12 行 qum operum (reg.: cum)(规范化:cum) qum (:cum) operum qum operum cum operum
Material analysis材料分析 marble · 6 candidate quarries (pXRF) marmor
Translation翻译 English (Prag)英文(Prag)
Image图像 tiled TIF 3680 × 5520 px + JPG photo via gallery画廊页有照片 scattered, when present偶有,未必有 none none
Bibliography (#)参考文献条数 25+ 14 few数条 3 1
Commentary学术评注 long, includes Punic-speaker debate详细,含"作者母语布匿"假说 "text follows Manni Piraino 1973""文从 Manni Piraino 1973"
License许可 CC-BY 4.0 CC-BY-NC-SA 4.0 unspecified未声明 unspecified未声明 unspecified未声明
Persistent DOI持久 DOI 10.5281/zenodo.4337543
Cells where databases disagree数据库分歧之处 italic = not recorded斜体=未记录

§ 4The seven issues七项问题

Issue 1问题 一

ID proliferation: one stone, eleven names.

编号泛滥:一块石头,十一个名字。

No single canonical identifier exists for this inscription. It is referenced by at least eleven distinct schemes — five born-digital database IDs, three print-corpus references, two epigraphic-bulletin references, and one persistent DOI. Worse, PHI alone uses two different IDs because its database is organized by which printed edition the text comes from, so the same physical inscription gets one PHI number per edition that published it.

这块铭文没有一个公认的"标准编号"。它至少在十一种不同的编号体系下出现:五个数字数据库 ID、三种印本丛刊参引、两种铭文学公报参引、一个持久 DOI。更麻烦的是,PHI 自己就用了两个 ID:因为 PHI 是按印本组织条目的,同一块铭文,被几次出版就出现几次。

I.Sicily ISic000470 EDR 140617 EDCS 22000882 PHI 175744 PHI 140601 TM 491798 DOI zenodo.4337543 CIL X 7296 IG XIV 297 ILS 7680 IGR 1.503 CIG 5554 IGLPalermo 139 ILMusPalermo 74 IGMusPalermo 139 AE 2000.643 / 2005.671 / 2011.437 SEG 26.1854 / 39.1017 / 44.1699 / 50.1016 / 61.753
Consequence for ETL对 ETL 的后果 — automated deduplication across datasets must rely on a multi-way crosswalk. The Trismegistos ID (TM 491798) and I.Sicily's TEI publicationStmt (which lists EDR, EDCS, and both PHI numbers) are the only places this crosswalk is recorded explicitly. Any dataset that doesn't ingest one of those will silently double-count this inscription. (For the broader identifier-systems map across all inscription databases, see the Atlas · Identifier systems table.) ,跨库去重必须依赖多向对照表。Trismegistos 编号 TM 491798 与 I.Sicily TEI 的 publicationStmt(同时列出 EDR、EDCS、两个 PHI 号)是仅有的两处明确记录该对照的地方。任何不摄入这两者的下游数据集,都会"安静地"把这块铭文当成两条以上重复条目。(铭文界标识符系统的全貌,参见 数据库地图 · 标识符系统表。)
Issue 2问题 二

The same museum, two different inventory numbers.

同一博物馆,两个馆藏号。

I.Sicily records the inscription as Salinas Museum inventory 3574 (with a former Museo Salnitriano number 51). EDR records it as Salinas inventory 8822. They cannot both be right. Either the museum changed inventory numbers and one database didn't update, or one of them transcribed wrong. Externally there is no way to tell which.

I.Sicily 记录的是萨利纳斯博物馆馆藏号 3574(旧 Museo Salnitriano 编号 51)。EDR 记录的是 8822。两者不可能都对。要么是博物馆改过号、其中一个库没跟上;要么是其中一个转录错误。从外部无从判断到底是哪一种情况。

I.Sicily
Inv. 3574 · former Museo Salnitriano no. 51
Salinas Museum, Palermo
EDR
Inv. 8822
Salinas Museum, Palermo
Consequence后果 — without a museum-level authority record (which neither database links to), there is no third-party source of truth. Anyone aggregating museum-inventory data across Sicilian collections has to pick one or flag both as uncertain. ,由于两个数据库都没有链接到博物馆官方馆藏数据,第三方真值无从获取。任何在西西里诸馆做馆藏号聚合的人,要么择一、要么把两者都标为存疑。
Issue 3问题 三

Three different date ranges, four databases, four answers.

三种定年方案,四个数据库,四种结论。

The dating disagreement here would single-handedly distort the JDH paper's "epigraphic habit" curve at this inscription's location:

仅这一条铭文的定年分歧,就能把 JDH 论文的"铭文习俗"曲线在该位置扭曲到不可识别:

I.Sicily
notBefore 0001, notAfter 0200 CE
"Augustan or Julio-Claudian, by letter forms""奥古斯都至尤利-克劳狄王朝,依字体"
EDR
100 BCE100 CE
no further reasoning given未给出更详尽理由
PHI 175744
"late 2nd c. AD" (or late Republican / 1st c. AD)"2 世纪晚期"(或共和晚期 / 1 世纪)
three competing dates listed同时列出三种相互冲突的定年
PHI 140601
undated无定年
no scholar named无作者署名
Consequence for the JDH paper对 JDH 论文的后果 — under midpoint dating, this single inscription would land at year 100 CE in I.Sicily, year 0 in EDR, year 175 in PHI 175744 (if the late date is taken), and contribute zero in PHI 140601. Under tempun's probabilistic dating, it would contribute a 200-year-wide uniform mass — but only if the database it came from has a date range. PHI 140601's "undated" record drops out of every temporal aggregation entirely. ,用中点定年法:这块铭文在 I.Sicily 落在公元 100 年,EDR 落在公元 0 年,PHI 175744 若取最晚说则落在 175 年,PHI 140601 完全不计入。用 tempun 概率定年法:它会贡献一段 200 年宽的均匀分布,但前提是源数据库日期。PHI 140601 的"无定年"会让它在任何时间汇总中彻底消失。
Issue 4问题 四

The "qum / cum" problem — two transcriptions of the same line.

"qum / cum"问题,同一行的两种转写。

Latin column, line 12. PHI 175744 transcribes qum operum — preserving the archaic spelling that helps date the text. PHI 140601 transcribes cum operum — silently normalizing it to standard Latin. Same database, same inscription, different transcriptions.

拉丁栏第 12 行:PHI 175744 转写为 qum operum,保留古拼写,正是帮助定年的关键证据。PHI 140601 转写为 cum operum,静默地规范化为后期拉丁文。同一个数据库,同一块铭文,两种转写。

PHI 175744
qum operum
archaizing form preserved保留古风拼写
PHI 140601
cum operum
silently normalized默默规范化
I.Sicily
<choice><orig>qum</orig><reg>cum</reg></choice> operum
EpiDoc records bothEpiDoc 同时记录两者
Consequence后果 — text-mining for archaic Latin features (a real research question — see Kruschwitz 2000 in the bibliography) would find this inscription via PHI 175744 but miss it via PHI 140601, or via EDCS depending on which transcription it inherited. Only I.Sicily's EpiDoc <choice> markup makes both the original and the regularized form recoverable from a single record. ,想要在文本中检索"古拼写"的研究者(这是真实存在的研究问题,见参考文献中的 Kruschwitz 2000)通过 PHI 175744 能找到这块铭文,通过 PHI 140601 就找不到,EDCS 也要看它沿用了哪一版的转写。只有 I.Sicily 的 EpiDoc <choice> 标记同时记录原文与规范形式,下游可以从同一条记录中两边都恢复。
Issue 5问题 五

Image presentation: from a tiled TIF to nothing at all.

图像呈现:从分块 TIF 高清图,到完全没有图像。

The image situation across these databases is wildly asymmetric:

五个库在图像方面差异极大:

  • I.Sicily — high-resolution tiled TIF (3680 × 5520 px) plus a print JPG, both encoded in the EpiDoc <facsimile> element with explicit attribution and license (CC-BY 4.0).
  • I.Sicily:高分辨率分块 TIF(3680 × 5520 px)以及一个印刷用 JPG,二者皆通过 EpiDoc <facsimile> 元素编码,附带明确署名与许可(CC-BY 4.0)。
  • EDR — typically has photos but on a separate gallery page; no anchored regions, no IIIF.
  • EDR:通常有照片,但放在独立的画廊页;没有图像分区锚定,没有 IIIF 标准接入。
  • EDCS — image presence is inconsistent across records; when present, it's a flat JPG with no metadata.
  • EDCS:图像有无在不同条目间参差不齐;即使有,也是没有元数据的扁平 JPG。
  • PHI (both records) — no images at all. PHI is a text-only corpus by design.
  • PHI(两条都是)—— 完全没有图像。PHI 设计上就是纯文本语料。
Consequence后果 — paleographic, material, and conservation analyses depend on photos, but a macro-historical analysis built on top of EDH+EDCS (the JDH paper) operates almost entirely without them. The inscription's visual evidence — letter forms that are key to dating it — is invisible to the data layer. ,字形学、材质学、保存状态等分析都依赖图像,而 JDH 论文所基于的 EDH+EDCS 几乎完全脱离图像运作。这块铭文最关键的视觉证据(用于定年的字体形态)对"数据层"是不可见的。
Issue 6问题 六

The text↔image anchoring problem.

文字↔图像锚定问题。

Even where images do exist, the relationship between the transcribed text and the photograph is rarely explicit. EpiDoc supports <facsimile> with <zone> elements that can pin each line (or even each character) to pixel coordinates on the photo. Almost no databases use this capability fully.

即使有图,"转写文本"与"照片"之间的关系也鲜有明确编码。EpiDoc 支持 <facsimile> + <zone> 把每一行(甚至每个字符)固定到照片像素坐标。几乎没有数据库充分使用这一能力。

I.Sicily's TEI declares letter-height measurements per line — line 1 = 22 mm, line 2 = 20 mm, line 3 = 8 mm, lines 4–7 = 10 mm — but does not tag pixel zones on the photograph. The reconstruction below is built directly from those measurements and from the actual photograph of the stone. Click any line in the right panel (or hover the SVG) to see exactly what zone-anchoring would look like, were any of the five databases publishing it.

I.Sicily 的 TEI 声明了每行字高,行 1 = 22 mm;行 2 = 20 mm;行 3 = 8 mm;行 4–7 = 10 mm:但没有在图像上标注像素分区。下方的摹本严格依据这些测量数据与原石照片绘制。点击右侧某一行(或将鼠标悬停在 SVG 上),就能看到,若五库中任何一个肯发布,文图分区锚定该长什么样。

METRIC 1 2 3 4 5 6 7 8 9 10 ϹΤΗΛΑΙ TITVLI ΕΝΘΑΔΕ HEIC ΤΥΠΟΥΝΤΑΙ ΚΑΙ ORDINANTVR ET ΧΑΡΑϹϹΟΝΤΑΙ SCVLPVNTVR ΝΑΟΙϹ ΙΕΡΟΙϹ AIDIBVS SACREIS ϹΥΝ ΕΝΕΡΓΕΙΑΙϹ QVM OPERVM ΔΗΜΟϹΙΑΙϹ PVBLICORVM 22 mm 20 mm 8 mm 10 mm 10 mm 10 mm 10 mm Stylised SVG facsimile after the I.Sicily photograph · CC-BY · letter heights from the TEI

Hover or click any line — its zone lights up on the SVG and its declared letter height appears on the right edge of the plaque.

将鼠标悬停或点击任意一行,对应分区会在 SVG 上亮起,并且该行的字高也会显示在石板右缘。

L1 — ϹΤΗΛΑΙ · TITVLI 22 mm L2 — ΕΝΘΑΔΕ · HEIC 20 mm L3 — ΤΥΠΟΥΝΤΑΙ ΚΑΙ · ORDINANTVR ET 8 mm L4 — ΧΑΡΑϹϹΟΝΤΑΙ · SCVLPVNTVR 10 mm L5 — ΝΑΟΙϹ ΙΕΡΟΙϹ · AIDIBVS SACREIS 10 mm L6 — ϹΥΝ ΕΝΕΡΓΕΙΑΙϹ · QVM OPERVM 10 mm L7 — ΔΗΜΟϹΙΑΙϹ · PVBLICORVM 10 mm

The actual photograph is hosted by I.Sicily under CC-BY 4.0; this SVG is a stylized reconstruction matching the published letter-height measurements. View the real photograph ↗

真实照片由 I.Sicily 以 CC-BY 4.0 许可托管;此 SVG 为按照已发布字高测量值的风格化重绘。查看原照 ↗

Notice three things the SVG demonstration makes concrete:

这一演示具体呈现了三件事:

  • The dramatic letter-height jump between lines 2 and 3 (20 mm → 8 mm) is itself a paleographic feature. It signals that the cutter laid out the headline ("STELAI · TITULI / ENTHADE · HEIC") in display capitals, then continued the body in a markedly smaller script. A purely text-based dataset row has no way to encode this, but a zone-anchored facsimile preserves it for free.
  • 第 2 行到第 3 行字高的剧烈下落(20 mm → 8 mm)本身就是一个字形学事实。说明刻工先用大字标题刻出"ϹΤΗΛΑΙ · TITVLI / ΕΝΘΑΔΕ · HEIC"两行,然后用明显更小的字接刻正文。纯文本数据集的某一行无从编码这一事实;但带分区锚定的摹本免费保留了它。
  • The dividing groove visible down the middle is itself a typographic decision — the cutter physically separates the two languages with a deeply incised vertical line, treating them as parallel columns rather than running text. None of the five databases encodes "deeply incised vertical column divider" as a structured field.
  • 正中那条贯穿到底的分隔凹槽本身就是一个版面决定,刻工用一道深刻的竖线把两种语言物理分开,让它们成为并置两栏而非连续文本。五个数据库中没有任何一个把"深刻竖向分栏沟"当作一个结构化字段。
  • The orange ferruginous staining visible in the photograph (and reproduced as small spots in this SVG) is metadata about conservation history, not the inscription itself. It belongs to material analysis. Only I.Sicily's TEI <objectDesc>/<condition> can express such things; the other databases have nowhere to put them.
  • 照片上可见的橙色铁锈状斑点(在此 SVG 中以小斑点重现)是关于保存史的元数据,不是铭文本身。它属于材质分析。只有 I.Sicily 的 TEI <objectDesc>/<condition> 能表达这类信息;其他数据库无处可放。
Consequence后果 — without text-image anchoring, an editor's correction ("the third letter on line 5 is actually a sigma not an epsilon") cannot be verified from the data layer; the dramatic line-height shift between lines 2 and 3 cannot be quantified; the dividing groove cannot be searched for; and the orange staining cannot be tracked across conservation campaigns. Visual epigraphic features become metadata-only attributes, severed from the evidence that generated them. This severance is exactly what makes "inscriptions as data" feel reductive to traditional epigraphers — and exactly the gap a fully encoded EpiDoc <facsimile> with <zone> elements would close, were it adopted across the corpus rather than at one project alone. ,没有文字-图像锚定,编辑者的修正(如"第 5 行第三个字母其实是 sigma 而非 epsilon")无法在数据层核验;行 2 与行 3 之间剧烈的字高变化无法量化;分隔凹槽无从检索;橙色锈斑也不能在历次修复中被追踪。视觉证据沦为只剩元数据描述、与产生它的证据脱钩。这种脱钩正是传统铭文学者觉得"铭文作为数据"过于削减的原因,也正是若把 EpiDoc <facsimile>+<zone> 编码全语料推广(而非只有 I.Sicily 一家)能够弥合的鸿沟。
Issue 7问题 七

The EDH gap.

EDH 的缺口。

The Heidelberg corpus — which the JDH paper repeatedly invokes as its highest-quality dataset — does not include this inscription. The I.Sicily TEI explicitly leaves the <idno type="EDH"/> field empty. Sicily and southern Italy have historically been outside EDH's scope (it focused on the Latin western provinces and the limes). The inscription is in EDCS, but EDCS-only analyses inherit all of that database's editorial roughness.

海德堡语料,即 JDH 论文反复称为"最高质量数据集"的那一个,并不包含这块铭文。I.Sicily 的 TEI 明确把 <idno type="EDH"/> 字段留空。西西里与意大利南部历来不在 EDH 的覆盖范围(它聚焦于拉丁西部行省与边境地带)。这块铭文倒是在 EDCS 里,但只用 EDCS 做分析,就得继承 EDCS 整个数据库相对粗略的编辑层。

Consequence后果 — any claim of the form "Latin inscriptions show pattern X across the Empire" that is computed only on EDH is missing huge parts of the southern Mediterranean and all of Sicily's bilingual epigraphic culture. The JDH paper's careful EDCSx subset (EDCS filtered to EDH-comparable quality) excludes this inscription too — because it would only enter EDCSx if EDH had it. The inscription is therefore present in the raw EDCS curve but absent from the most rigorous comparable curve in the paper. ,任何"帝国范围内拉丁铭文呈现某种模式"的结论,若仅基于 EDH,就漏掉了地中海南部的大部分内容以及整个西西里的双语铭文文化。JDH 论文谨慎构造的 EDCSx 子集(把 EDCS 筛到与 EDH 可比的质量)同样排除这条铭文,因为只有 EDH 有它,EDCSx 才会保留。结果:这条铭文出现在原始 EDCS 曲线里,但在论文最严谨的可比曲线里反而消失。

§ 5A merge simulator合并模拟

If you fed all five database records into the SDAM LI_ETL deduplication pipeline, here's what would happen.

如果把全部五条记录都喂给 SDAM 的 LI_ETL 去重流水线,结果会是这样:

I.Sicily
date: 1–200 CE
inv: 3574
width: 24.5
EDR
date: -100–100
inv: 8822
width: 14.5
EDCS
date: ?
inv: —
width: —
PHI 175744
date: 100–200
line12: qum
PHI 140601
date: undated
line12: cum
CONFLICT · date: 5 candidates spanning 100 BCE → 200 CE
CONFLICT · inventory: 3574 ≠ 8822
CONFLICT · width: 24.5 ≠ 14.5 cm
CONFLICT · transcription line 12: qum vs cum
WARNING · 2 PHI records likely refer to same inscription (TM 491798)
NOTICE · EDH absent — out of scope
Naive merge resolution (pick highest-quality source per field): take I.Sicily everywhere it has data, fall back to EDR otherwise. Net result: this single inscription is unmerge-able into a flat row without losing scholarly content from the other four records — bibliography, alternate transcriptions, alternate dating, alternate inventory. 朴素的合并方案(按字段挑最高质量来源):凡 I.Sicily 有数据就取 I.Sicily,否则退回 EDR。结果:这一条铭文 无法被压扁为一行,否则就要丢掉另外四条记录里的学术内容,文献、替代转写、替代定年、替代馆藏号。

§ 6Why this hinders the JDH paper's analyses为何这阻碍了 JDH 论文的分析

The Heřmánková–Kaše–Sobotková paper is one of the most rigorous attempts to do macro-history with this kind of data. This single inscription shows where every single one of its careful methodological moves still has to absorb cost.

Heřmánková–Kaše–Sobotková 三人的论文是用此类数据做宏观史最严谨的尝试之一。这一块铭文恰恰展示:他们每一个谨慎的方法论举措,依然要付出代价。

  • The "epigraphic habit" curve (Fig 1): one inscription with a 200-year date range contributes a flat plateau of 0.005 inscriptions/year × 200 years to the empire-wide aggregate. Across hundreds of thousands of inscriptions this averages out — but only if the date ranges are consistent. Here, four different ranges from four databases would each shift the local curve.
  • "铭文习俗"曲线图 1):一条带 200 年区间的铭文,对帝国整体的贡献是 0.005 条/年 × 200 年的均匀平台。汇总到几十万条铭文时,平均化能掩盖个体差异,但前提是日期范围本身一致。这里四个数据库给四个不同范围,每一个都会把局部曲线推向不同方向。
  • Provincial distribution (Figs 4–6): EDR's "Sic?" with a question mark cannot be aggregated cleanly with EDH's confident "Sicilia" attribution. A clean province-rank-order plot must either drop "Sic?" (losing data) or treat it as confident "Sicilia" (overcounting).
  • 行省分布图 4–6):EDR 标注的 "Sic?"(带问号)无法干净地与 EDH 确凿的 "Sicilia" 聚合。要画一张干净的"行省次序图",要么丢掉 "Sic?"(损失数据),要么把它当作确凿的 "Sicilia"(超统计)。
  • Type distribution (Fig 2): I.Sicily classifies this inscription as function.advertisement (EAGLE term 128). EDR doesn't have an explicit type for "advertisement" — it falls under cetera (other). PHI doesn't classify by function at all. The same physical object would be counted as "advertisement" in one analysis, "other" in another, "no type" in a third.
  • 类型分布图 2):I.Sicily 把这块铭文分类为 function.advertisement(EAGLE 词表 128)。EDR 没有明确的"广告"类型,落入 cetera(其他)。PHI 根本不按功能分类。同一件实物,在一种分析中计为"广告"、在另一种中计为"其他"、在第三种中"无类型"。
  • The bilingual question: I.Sicily encodes this as biversion.duplicating Latin + Greek with a structured taxonomy (textLang mainLang="la" otherLangs="grc"). EDCS has latina-graeca as a flat string, EDR has latina-graeca as a flat string, PHI catalogs it under "Greek" because it's in PHI Greek. Aggregating "how many Latin inscriptions are bilingual?" is therefore a labyrinth.
  • 双语问题:I.Sicily 把这块铭文编码为 biversion.duplicating(双版本对译),并使用结构化的语言分类(textLang mainLang="la" otherLangs="grc")。EDCS 标 latina-graeca,但只是平面字符串。EDR 也是 latina-graeca 字符串。PHI 把它归为"希腊文"因为它在 PHI Greek 数据库里。要回答"多少拉丁铭文是双语的",会陷入字段对照的迷宫。
The takeaway.要点。 The JDH paper's central methodological commitment — make the dataset construction itself transparent and reproducible — is exactly the right response to the situation this case study illustrates. There is no "the correct dataset" hidden behind the messiness; there are only choices, and good macro-history makes those choices visible. The case of ISic000470 shows what happens to a single object when those choices are not made explicit: it lives in five databases, with at least four conflicting dates, two conflicting inventory numbers, two conflicting widths, and two conflicting transcriptions of one of its lines. JDH 论文最核心的方法论承诺,把数据集的构建过程本身做到透明、可复现,正是对本案例情形的正确回应。不存在一个隐藏在杂乱背后的"正确数据集",只有抉择;而好的宏观史把这些抉择呈现出来。ISic000470 的故事说明,如果不把抉择讲清楚,单单一件文物就会出现在五个数据库里、至少有四个冲突的定年、两个冲突的馆藏号、两个冲突的尺寸,以及一行文字的两种冲突转写。

§ 7What I.Sicily models wellI.Sicily 提供的范例

It is fair to single out one record as a benchmark. I.Sicily's TEI for ISic000470 demonstrates, in a single open-access XML file, what comprehensive data-aware epigraphy looks like:

单独把一条记录作为基准并无不公。I.Sicily 为 ISic000470 提供的那份 TEI 文件,集中展现了"数据意识"完备的铭文学应具备什么:

  • Identifier crosswalk — every external ID (EDR, EDCS, PHI ×2, TM, DOI, all the print refs) is recorded in publicationStmt.
  • 编号对照表:在 publicationStmt 中记录每一个外部编号(EDR、EDCS、两个 PHI、TM、DOI、所有印本参引)。
  • Linked authority files — Pleiades for the ancient place (Panhormus = pleiades 462410), GeoNames for the modern (Palermo = 2523920), Eagle Network vocabularies for inscription type and material, ORCID for editors.
  • 权威文件链接:古地名链接 Pleiades(Panhormus = 462410)、现代地名链接 GeoNames(Palermo = 2523920)、铭文类型与材质链接 Eagle Network 词表、编辑者链接 ORCID。
  • Original alongside cleaned<choice><orig>qum</orig><reg>cum</reg></choice>: the EpiDoc convention for keeping both forms recoverable from the same record.
  • 原值与清洗值并存<choice><orig>qum</orig><reg>cum</reg></choice>:EpiDoc 让原始形式与规范形式同时从同一条记录可恢复。
  • Material analysis as scholarly contribution — pXRF candidate quarry list with the specific scholar (Alessia Coccato) credited.
  • 把材质分析当学术成果:pXRF 候选采石场列表,并具名(Alessia Coccato)。
  • Versioned provenance<revisionDesc> records every edit to the record, dated and signed by ORCID.
  • 带版本号的修订记录<revisionDesc> 记录对该条目的每一次修改,带日期与 ORCID 签名。
  • License + DOI — explicit CC-BY 4.0, with a Zenodo DOI for citing this specific record.
  • 许可 + DOI:明确 CC-BY 4.0;附 Zenodo DOI,可引用本条目特定版本。

If every Latin and Greek epigraphic database recorded inscriptions to this standard, the JDH paper's data-construction story would be much shorter — and macro-historical aggregation across databases would be tractable instead of artisanal.

如果所有拉丁文与希腊文铭文数据库都按这一标准记录铭文,JDH 论文的"数据构建"叙事会大大缩短,跨库宏观汇总也将由"手工活"变为"可计算"。

§ 8Sources for this case study本案例的出处

The five database records used

使用的五条数据库记录

  1. I.Sicily ISic000470 — primary record, with linked TEI EpiDoc XML.
  2. ISic000470 raw TEI XML — the source of every detail in this walkthrough.
  3. EDR 140617
  4. EDCS-22000882
  5. PHI 175744 (IGLPalermo 139)
  6. PHI 140601 (IG XIV 297)

Print editions cross-referenced

交叉参考的印本

  1. Mommsen, T. (1883). CIL X, no. 7296.
  2. Kaibel, G. (1890). IG XIV, no. 297.
  3. Manni Piraino, M. T. (1973). IGMusPalermo, no. 139.
  4. Bivona, L. (1970). ILMusPalermo, no. 74.
  5. Wilson, R. J. A. (1990). Sicily under the Roman Empire, p. 314 fig. 266.
  6. Tribulato, O. (2011 / 2012). On the bilingualism of this inscription.
  7. Consani, C. (2021). On the inscription as the work of a Latin speaker translating literally into Greek.

The methodological frame

方法论框架

  1. Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004 — see the Paper Edition for a section-by-section walkthrough.
  2. Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. CHR 2023. CEUR-WS

Companion editions

配套版本

  1. Paper EditionJDH 2021 walkthrough.JDH 2021 论文逐节导读。
  2. Reference Editiondeep technical companion to all 37 SDAM repositories.SDAM 37 个仓库的深度技术配套。
  3. Visual Edition19-slide interactive intro to ETL.ETL 的 19 张交互式幻灯片导览。
  4. Landing pagechoose your starting point.选择入口。