From stone to data, in depth.
从石头 到数据,深入版。
A complete technical companion to the SDAM project's open ETL pipelines, temporal-uncertainty toolkit, helper packages, and analysis ecosystem — covering all 37 repositories.
SDAM 项目开源 ETL 流水线、时间不确定性工具包、辅助库与分析生态的完整技术配套,覆盖全部 37 个仓库。
This is the Reference Edition. If you want intuition first, the Visual Edition explains the same material as a 19-slide slideshow with metaphors, button-driven interactives, and code hidden behind toggles. The two editions cross-reference each other throughout — switch any time using the toggle in the top bar. Click any underlined term for an inline glossary popover with related links.
这是参考版(Reference)。若想先建立直觉,视觉版 用 19 张幻灯片以比喻、按钮交互讲解相同内容,代码隐藏在折叠面板后。两版之间相互交叉引用,顶栏切换器可随时切换。点击任何带下划线的术语可弹出释义气泡及相关链接。
The article behind the project
项目背后的论文
Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004. Companion notebook: sdam-au/digital_epigraphy. For a section-by-section walkthrough of the article itself, see the Paper Edition → Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective(铭文作为数据:宏观史视角下的数字铭文学). Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004。配套笔记本:sdam-au/digital_epigraphy。 论文本身的章节导读,见 论文版 →§ 1The vocabulary术语词汇 Visual: E·T·L视觉版:E·T·L
Five terms recur across every SDAM repository. Pin them down once and the rest is straightforward.
五个术语贯穿所有 SDAM 仓库。理清这五个,其余就好懂了。
Source. The original location of the data: an external database (EDH, EDCS, PHI), a public folder of files, or a website. SDAM consumes them; it does not own them.
数据来源 (Source):数据的原始位置,外部数据库(EDH、EDCS、PHI)、公共文件目录,或网站。SDAM 是使用方,不是维护方。
Extract. Fetching everything the source has, in whatever shape it provides, with no semantic changes. Outputs raw .json or raw .tsv.
提取 (Extract):按数据源原本的格式取走全部内容,不做语义修改。输出原始 .json 或 .tsv。
Transform. Standardizing fields, normalizing text, enriching with linked data (place gazetteers, authority IDs), deduplicating. Outputs cleaned .json or .parquet.
转换 (Transform):统一字段、规范化文本、用关联数据(地名词典、权威 ID)增强、去重。输出已清洗的 .json 或 .parquet。
Load. Publishing the cleaned dataset to two destinations: sciencedata.dk (working storage and public-read folder) and Zenodo (DOI-stamped, citable, frozen archive).
加载 (Load):把已清洗数据集发布到两个地方:sciencedata.dk(工作存储与公共可读文件夹)与 Zenodo(带 DOI 的可引用永久存档)。
Analysis. What gets built on top of the cleaned data — a substantive research question pursued in its own repository, citing the cleaned dataset by DOI.
分析 (Analysis):在清洗数据之上展开的工作,每个具体研究问题独立成仓库,通过 DOI 引用所用数据集。
§ 2The full SDAM repository mapSDAM 仓库总览
All 37 public repositories in the sdam-au organization, grouped by role. Click a column header to sort; click a chip to filter; click any repo name to open its GitHub page.
sdam-au 组织下全部 37 个公开仓库,按职能分组。点击列名排序;点击标签筛选;点击仓库名打开其 GitHub 页面。
| Repository仓库 | Description描述 | Theme主题 | ★ | Updated更新 |
|---|
§ 3The ETL pipelinesETL 流水线 Visual: pipelines视觉版
SDAM operates six independent ETL pipelines: three for Latin inscriptions, one for Greek inscriptions, one for Greek literary texts, one for Bulgarian burial-mound archaeology. All follow the same architectural pattern.
SDAM 共运行六条独立 ETL 流水线:三条针对拉丁文铭文,一条希腊文铭文,一条希腊文文献,一条保加利亚坟丘考古。结构高度一致。
EDH_ETL sdam-au/EDH_ETL
Latin inscriptions · API + XML拉丁文铭文 · API + XML Visual: vending-machine demo视觉版:自动售货机演示The first and most documented SDAM pipeline. Targets the Epigraphic Database Heidelberg, accessing it through both its public JSON API and its EpiDoc XML downloads. The two sources are complementary: the API returns simplified, query-friendly records; the XML preserves the full editorial encoding, including human-readable dating expressions like "around the middle of the 4th century CE" that the API has flattened into not_before/not_after integers.
SDAM 最早、文档最齐全的流水线。处理对象是 海德堡铭文数据库(EDH),通过公共 JSON API 与 EpiDoc XML 双通道获取。两者互补:API 返回简化、利于查询的记录;XML 保留完整的编辑编码,包括如"约 4 世纪中叶"等人类可读的日期表达,而 API 已把它们压扁为 not_before/not_after 整数。
The numbered scripts (in execution order)编号脚本(按执行顺序)
| 1_0 (py) | Extract geo-dictionary from API → 从 API 提取地理词典 → EDH_geo_dict_[ts].json |
| 1_1 (py) | Walk the API one inscription at a time (≈12 min) → 逐条走完 API(约 12 分钟)→ EDH_onebyone_[ts].json |
| 1_2 (py) | Parse the EpiDoc XML dumps → 解析 EpiDoc XML 转储 → EDH_xml_data_[ts].json |
| 1_3 (py) | Merge geo + API + XML by inscription ID → 按铭文 ID 合并地理 + API + XML → EDH_merged_[ts].json |
| 1_4 (R) | Standardize attributes (dates, places, types) → 标准化属性(日期、地点、类型)→ EDH_attrs_cleaned_[ts].json |
| 1_5 (R) | Clean inscription text → 清洗铭文文本 → EDH_text_cleaned_[ts].json |
The actual API-walking code实际的 API 遍历代码
import requests, json, time API = "https://edh.ub.uni-heidelberg.de/data/api/inscriptions/search" records, page = [], 1 while True: r = requests.get(API, params={"limit": 200, "offset": (page-1)*200}) chunk = r.json().get("items", []) if not chunk: break records.extend(chunk) page += 1 time.sleep(0.2) with open("EDH_onebyone.json", "w") as f: json.dump(records, f, ensure_ascii=False)
Final dataset (2022 v2): 81,883 cleaned Latin inscriptions, 69 attributes. DOI 10.5281/zenodo.7303886. Public mirror at https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/.
最终数据集(2022 v2):81,883 条已清洗的拉丁文铭文,69 个属性。DOI 10.5281/zenodo.7303886。公共镜像 https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/。
EDCS_ETL sdam-au/EDCS_ETL
Latin inscriptions · scraper拉丁文铭文 · 网页抓取 Visual: scraper grid视觉版:抓取网格Targets the Epigraphik-Datenbank Clauss/Slaby. EDCS exposes no API — only an HTML search interface. Extraction is therefore a scraping operation conducted province by province, using Lat Epig 2.0 (Macquarie University). Lat Epig wraps the EDCS interface in Docker and outputs one TSV per Roman province.
处理 克劳斯/斯拉比铭文数据库(EDCS)。EDCS 不提供 API,只有 HTML 搜索界面。因此只能逐个罗马行省做 网页抓取,使用 Macquarie 大学开发的 Lat Epig 2.0。Lat Epig 用 Docker 包装 EDCS 界面,每个行省输出一个 TSV 文件。
End-to-end procedure端到端流程
- Install Docker; clone
mqAncientHistory/Lat-Epig; switch to thescrapeprovincesbranch. - 安装 Docker;克隆
mqAncientHistory/Lat-Epig;切换到scrapeprovinces分支。 - Run
bash dockerScraperAll.sh. Takes ~4–5 hours, producesfull_scrape_[YYYY-MM-DD]/. - 运行
bash dockerScraperAll.sh。约 4–5 小时,产出full_scrape_[YYYY-MM-DD]/。 - Move the output into
EDCS_ETL/data/2022_09_allProvinces/. - 将输出移到
EDCS_ETL/data/2022_09_allProvinces/。 - Run the two R Markdown transform notebooks:
1_1(merge province TSVs and clean attributes) and1_2(clean inscription text). - 运行两个 R Markdown 转换笔记本:
1_1(合并行省 TSV 并清洗属性)与1_2(清洗铭文文本)。
Final dataset (2022 v2): 537,286 cleaned Latin inscriptions. DOI 10.5281/zenodo.7072337.
最终数据集(2022 v2):537,286 条已清洗的拉丁文铭文。DOI 10.5281/zenodo.7072337。
1_5_r_EDCS_text_lemmatization_UDpipe.Rmd attempts UDPipe lemmatization. The README states: "upon closer inspection, the results of such lemmatization were unsatisfactory." The script is kept as a record of what was tried, not as something to use.
一次诚实的负面结果。笔记本 1_5_r_EDCS_text_lemmatization_UDpipe.Rmd 尝试 UDPipe 词形还原。README 直白写道:"经仔细检查,词形还原效果不令人满意。"脚本保留作为尝试记录,并非建议使用。
LI_ETL sdam-au/LI_ETL
Merged Latin · ML合并拉丁文 · 机器学习 Visual: merge demo视觉版:合并演示The most intellectually interesting pipeline. Combines cleaned EDH and EDCS to produce LIST (Latin Inscriptions in Space and Time, the union) and LIRE (Latin Inscriptions of the Roman Empire, a spatio-temporally restricted subset). LIRE is what most published macro-historical analyses use.
最有思辨意味的流水线。把清洗后的 EDH 与 EDCS 合并,产出 LIST(Latin Inscriptions in Space and Time,时空中的拉丁铭文,并集)与 LIRE(Latin Inscriptions of the Roman Empire,限定罗马帝国时空范围的子集)。已发表的宏观史分析大多用 LIRE。
The two hard problems两个棘手问题
Deduplication. Many inscriptions appear in both EDH and EDCS under different IDs. LI matches them by combining CIL/AE references, geographic proximity, date-range overlap, and text similarity. Where matched, attributes are merged column-by-column with EDH preferred for shared fields.
去重 (Deduplication)。很多铭文同时出现在 EDH 与 EDCS,但各有不同 ID。LI 通过 CIL/AE 编号、地理邻近、日期重叠、文本相似度综合匹配。匹配后属性逐列合并,共有字段优先采用 EDH 版本。
Type harmonization. EDH uses an English taxonomy in type_of_inscription_clean; EDCS uses Latin in inscr_type with a different category structure. LI trains a classifier on overlap inscriptions and predicts EDH-style labels for EDCS-only records, recording confidence in type_of_inscription_prob.
类型统一 (Type harmonization)。EDH 在 type_of_inscription_clean 中用英文分类;EDCS 在 inscr_type 中用拉丁文,且分类体系不同。LI 在重叠铭文上训练分类器,为仅 EDCS 的记录预测 EDH 风格标签,置信度记入 type_of_inscription_prob。
One-line load (geopandas)一行加载(geopandas)
import geopandas as gpd LIRE = gpd.read_parquet("https://zenodo.org/record/7577788/files/LIRE_v2-1.parquet?download=1") LIST = gpd.read_parquet("https://zenodo.org/record/7870085/files/LIST_v0-4.parquet?download=1")
GI_ETL sdam-au/GI_ETL
Greek inscriptions · CSV → JSON希腊文铭文 · CSV → JSON Visual: all corpora视觉版:全部语料The Greek-inscriptions counterpart to EDH_ETL. Targets the PHI Greek Inscriptions dataset from the Packard Humanities Institute as a zip of CSV files. Enriches with metadata from Trismegistos, an interdisciplinary metadata platform for the ancient world.
EDH_ETL 在希腊文铭文上的对应版本。处理对象是来自 Packard Humanities Institute 的 PHI 希腊铭文集,以 CSV 压缩包形式提供。用 Trismegistos(古代世界跨学科元数据平台)的元数据增强。
Note step 1_4: this pipeline integrates tempun-style probabilistic dating directly in the ETL stage, generating a random_dates column ready for downstream Monte Carlo aggregation.
注意第 1_4 步:本流水线在 ETL 阶段直接集成 tempun 风格的概率定年,生成 random_dates 列以便下游做蒙特卡洛汇总。
LAGT — Lemmatized Ancient Greek Texts已词形还原的古希腊文献 sdam-au/LAGT
Greek literary texts · lemmatized希腊文献 · 词形还原SDAM's pipeline for ancient Greek literary texts (as distinct from inscriptions). Aggregates and lemmatizes works from four upstream corpora: Perseus Digital Library, First 1000 Years of Greek, the GLAUx corpus, and EXPRECCE (early Christian texts).
SDAM 处理古希腊文献(与铭文区分)的流水线。汇总并词形还原四个上游语料的作品:Perseus Digital Library、First 1000 Years of Greek、GLAUx 语料、以及 EXPRECCE(早期基督教文献)。
Version 4.1 covers 1,958 works from 475+ authors, totaling 35,809,325 tokens, spanning the 8th century BCE through the 6th century CE.
v4.1 涵盖 1,958 部作品、超过 475 位作者、共 35,809,325 词元,跨度公元前 8 世纪至公元 6 世纪。
mounds_ETL sdam-au/mounds_ETL
Archaeology · Bulgaria考古 · 保加利亚The same ETL discipline applied to non-textual archaeology. Targets a dataset of burial mounds from Thracian Bulgaria — a survey-archaeology corpus very different in shape from inscription corpora, but treatable with the same approach.
把同样的 ETL 方法应用到非文本考古:色雷斯保加利亚地区的坟丘调查数据集,形态与铭文语料截然不同,但用同一套思路也能处理。
§ 4Temporal uncertainty时间不确定性: tempun Visual视觉版
Most ancient artifacts are dated to a range, not a year. tempun treats the range as a probability distribution, draws Monte Carlo samples, and aggregates across thousands of records.
大多数古代物件只有一个日期范围,没有具体年份。tempun 把范围当作概率分布,抽取 蒙特卡洛 (Monte Carlo) 样本,跨数千条记录汇总。
The method paper
方法论文
Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. Proceedings of CHR 2023, 413–25. CEUR-WS PDF Kaše, V., Sobotková, A., & Heřmánková, P. (2023). 《历史数据集中时间不确定性的建模》(Modeling Temporal Uncertainty in Historical Datasets)。CHR 2023 论文集,413–25 页。CEUR-WS PDFThe intuition直觉
If 1,000 inscriptions are each dated to "100–200 CE", midpoint dating puts all 1,000 at year 150 — a fake spike. Distributing each inscription's probability uniformly across its range gives a flat plateau from 100 to 200, which is more honest. tempun generalizes this, drawing N random dates per record (default 1,000) and aggregating.
假设 1,000 条铭文都被定年为 "100–200 CE",中点法把它们全部放在 150 年,一个虚假的峰值。把每条铭文的概率均匀分布在范围内,则得到 100–200 之间的平坦区域,更诚实。tempun 把这套做法推广开来:每条记录随机抽 N 个日期(默认 1,000),再汇总。
tempun_package
The reusable Python library, MIT-licensed. Functions for: drawing random dates, computing aoristic sums, building probability-mass histograms, combining samples across records.
可复用的 Python 库(MIT 许可)。提供:抽取随机日期、计算 aoristic sum、生成概率质量直方图、跨记录合并样本等功能。
tempun_demo
Companion notebook for the 2023 CHR paper. Demonstrates Monte Carlo on real and synthetic data, compared with naive midpoint.
2023 CHR 论文的配套笔记本。在真实与合成数据上演示蒙特卡洛方法,并与朴素中点法对照。
tempun-web-interface
A web frontend that lets users upload a CSV with date columns and receive a tempun histogram in the browser, no Python install required.
网页前端,允许用户上传带日期列的 CSV 并在浏览器中得到 tempun 直方图,无需安装 Python。
tempun_in_R
A bridge for invoking the Python tempun from inside R. Useful when the rest of your analysis is in R.
从 R 调用 Python 版 tempun 的桥接。当其余分析在 R 中时尤为方便。
§ 4.5The two languages: Jupyter (Python) and R两种语言:Jupyter (Python) 与 R
Every SDAM ETL pipeline mixes two computational environments. Knowing which is which makes the rest of the documentation suddenly readable.
SDAM 每条 ETL 流水线都融合了两种计算环境。先理清它们各是什么,后面的文档就突然好读了。
What is a Jupyter Notebook?什么是 Jupyter Notebook?
A Jupyter notebook is a document that interleaves code, narrative prose, output (text, tables, images, charts), and citations in a single browser-based interface. Each notebook is a .ipynb file (just JSON under the hood). You execute one cell at a time, see its output inline, then move on; intermediate variables persist so you can experiment.
Jupyter notebook 是一种文档形态:把代码、叙述性文字、输出(文字、表格、图像、图表)与引证交织在同一个浏览器界面里。每份笔记本是一个 .ipynb 文件(底层就是 JSON)。逐格执行,结果就地显示;中间变量持续存在,便于试验。
For ancient-history data work, the Jupyter format has three useful properties:
对古史数据工作而言,Jupyter 格式有三个有用属性:
- The narrative is part of the artifact. A SDAM extraction notebook explains in prose why it pages the API at 200 records per request, then shows the code that does it, then shows the output. A traditional script would only have the code.
- 叙述是产物的一部分。SDAM 的提取笔记本会用散文解释"为何按 200 条/页分页调用 API",然后给出实现代码,再给出输出。传统脚本只有代码。
- Outputs are first-class. Charts and tables are part of the saved file, not an afterthought. The seven figures in the JDH paper exist inside the companion notebook
Digital_epigraphy.ipynb; reopening the notebook regenerates them. - 输出是一等对象。图表与表格作为保存文件的一部分,而非附属物。JDH 论文的七张图就在配套笔记本
Digital_epigraphy.ipynb中;重新打开就重新生成。 - Notebooks are diffable. Because the file is JSON, version control (git) tracks who changed which cell, when, and why — making peer review possible at the code level, not just the prose level. SDAM's EDH_ETL/scripts/ shows this explicitly.
- 笔记本可做差异比对。文件本质是 JSON,版本控制(git)能追踪每一格谁改、何时改、为何改,让同行评审在代码层而不只在散文层成为可能。SDAM 的 EDH_ETL/scripts/ 直接展示了这一点。
SDAM names notebooks by execution order: 1_0_py_…, 1_1_py_…, 1_2_py_… for Python ones, 1_4_r_…, 1_5_r_… for R ones. The number is the canonical order; the suffix tells you which language.
SDAM 按执行顺序为笔记本命名:1_0_py_…、1_1_py_…、1_2_py_… 是 Python 笔记本,1_4_r_…、1_5_r_… 是 R 笔记本。数字是规范顺序,后缀说明语言。
What is R?什么是 R?
R is a programming language built specifically for statistics and data analysis. It started in 1993 at Auckland and is now maintained by an international foundation. In the digital-humanities world, R is the lingua franca for two clusters of work: traditional statistics (regression, hypothesis tests, time series) and tabular data wrangling via the popular tidyverse family of packages.
R 是一门专为统计与数据分析设计的编程语言,1993 年起源于奥克兰,现由一个国际基金会维护。在数字人文领域,R 是两类工作的通用语言:传统统计(回归、假设检验、时间序列)与表格数据整理(通过流行的 tidyverse 包族)。
SDAM uses R specifically for the cleaning steps (notebooks ending in _r_…). The reason is pragmatic: R's tidyverse offers a particularly clean syntax for the kind of row-by-row coercion that dominates Transform — collapsing synonyms, splitting multi-valued cells, applying regex to thousands of inscription texts. Below is the actual cleaning logic for the EpiDoc text, from EDH 1_5:
SDAM 在清洗步骤(以 _r_… 结尾的笔记本)中专门使用 R。原因很务实:R 的 tidyverse 对"逐行强制转换"这类工作(合并同义、拆分多值、对成千上万条铭文文本应用正则)提供了极清爽的语法。下面就是 EDH 1_5 中清洗 EpiDoc 文本的真实逻辑:
# Produce two text variants from one EpiDoc-marked inscription: # • interpretive — drop brackets/dots, keep restorations # • conservative — only what is on the stone EDH$clean_text_interpretive <- EDH$inscription_raw |> str_remove_all("\\[|\\]") |> # [...] = restored letters → keep them str_remove_all("\\(|\\)") |> # (...) = expanded abbreviation → keep expansion str_remove_all("\\.") |> # . = missing letter of known length str_replace_all("/", " ") |> # / = line break → preserve as space str_squish() # collapse whitespace EDH$clean_text_conservative <- EDH$inscription_raw |> str_remove_all("\\[[^\\]]*\\]") |> # [...] = restored letters → DROP str_remove_all("\\(|\\)") |> # keep expansions still str_remove_all("\\.") |> str_replace_all("/", " ") |> str_squish()
Each call to str_remove_all() is a scholarly decision. The same EpiDoc convention "[…]" can mean "letters once on the stone but now lost" — for an interpretive reading you want them; for a paleographic count you don't. SDAM's choice to ship both variants in the same dataset row (clean_text_interpretive + clean_text_conservative) is exactly the "keep the original alongside cleaned" rule made operational.
每一次 str_remove_all() 调用都是一次学术抉择。EpiDoc 的同一个标记"[…]"既可解作"曾在石上、现已残损的字母",做诠释性阅读时你想要它们,做字形统计时你不想。SDAM 在同一条记录中同时发布两种版本(clean_text_interpretive+clean_text_conservative),正是"保留原值与清洗值并存"原则的操作化。
Coding for Extraction: API walking, in real Python提取的代码实现:实战 Python 中的 API 遍历
The actual code that walks the EDH API and produces EDH_onebyone_[ts].json is short — about 25 lines. Here it is, lightly annotated, from EDH 1_1:
真实驱动 EDH API 遍历、产出 EDH_onebyone_[ts].json 的代码只有约 25 行。下面是带注释版本,引自 EDH 1_1:
import requests, json, time from tqdm import tqdm API = "https://edh.ub.uni-heidelberg.de/data/api/inscriptions/search" records = [] page = 1 PAGE_SIZE = 200 # EDH caps a single response — paginate or lose data while True: response = requests.get(API, params={ "limit": PAGE_SIZE, "offset": (page - 1) * PAGE_SIZE, }) response.raise_for_status() # fail fast if the server hiccups chunk = response.json().get("items", []) if not chunk: # empty page = we have walked the whole corpus break records.extend(chunk) page += 1 time.sleep(0.2) # be a polite citizen of academic infrastructure with open(f"EDH_onebyone_{date.today()}.json", "w", encoding="utf-8") as f: json.dump(records, f, ensure_ascii=False, indent=2) print(f"Saved {len(records):,} inscriptions across {page} pages.")
Three things in that snippet are not obvious to non-coders:
这段代码中有三件事对非编程读者来说并不显而易见:
- The
while Trueloop is the entire ETL "Extract" stage. It's not magic — it just keeps calling the API with biggeroffsetvalues until the API hands back nothing. The loop runs about 410 times for EDH (~81,883 records ÷ 200 per page). - 那个
while True循环就是整个 ETL "提取"阶段。它不是黑魔法,只是反复用更大的offset调用 API,直到 API 返回空。对 EDH 而言循环大约执行 410 次(81,883 条 ÷ 每页 200 条)。 - The
time.sleep(0.2)is a courtesy. It tells the script to pause 200 ms between calls so EDH's server doesn't see this as a denial-of-service attack. Academic APIs aren't always rate-limited; treating them as if they were is good citizenship. time.sleep(0.2)是一种礼让。让脚本在两次调用之间暂停 200 毫秒,避免 EDH 服务器把这次抓取误判为拒绝服务攻击。学术 API 未必有速率限制;当作有限制对待是良好公民行为。ensure_ascii=False, indent=2matters. Greek characters (στῆλαι) would otherwise be escaped toστῆλαι. The file would still work, but humans couldn't read it. SDAM's choice here is what makes the JSON files browsable.ensure_ascii=False, indent=2重要。否则希腊字符(στῆλαι)会被转义为στῆλαι。文件仍然可用,但人眼读不出来。SDAM 在这里的选择,让 JSON 文件可被肉眼浏览。
Coding for cleaning: regular expressions on inscription text清洗的代码实现:在铭文文本上跑正则
Most of the cleaning work is regular-expression substitutions ("regex"). A regex is a tiny pattern language for "find these characters and replace them with those." It's how every database normalizes EpiDoc markup — and it's where most editorial loss happens.
清洗工作大多是正则表达式替换(regex)—— 一种"找出这些字符,换成那些字符"的微型模式语言。所有数据库就是这样规范化 EpiDoc 标记的,也正是大部分编辑性丢失发生的地方。
import re raw = "D(is) [M(anibus)] / Iuliae [- - -] / vix(it) ann(os) XX" # Step 1: strip parentheses but keep their content (expanded abbreviation) no_parens = re.sub(r"[()]", "", raw) # 'Dis [Manibus] / Iuliae [- - -] / vixit annos XX' # Step 2: handle brackets — choice point. # interpretive: keep what's inside # conservative: drop the bracketed text entirely interpretive = re.sub(r"[\[\]]", "", no_parens) conservative = re.sub(r"\[[^\]]*\]", "", no_parens) # Step 3: turn / into a line-break marker (or strip) interpretive = interpretive.replace("/", " ").strip() conservative = conservative.replace("/", " ").strip() # Step 4: collapse whitespace interpretive = re.sub(r"\s+", " ", interpretive) conservative = re.sub(r"\s+", " ", conservative) print("interpretive:", interpretive) # Dis Manibus Iuliae vixit annos XX print("conservative:", conservative) # Dis Iuliae vixit annos XX ← "Manibus" is gone, because it was restored
The two final lines tell the whole story. Interpretive reads "To the divine spirits of Iulia, who lived 20 years." Conservative reads "To the divine of Iulia, who lived 20 years" — because Manibus was a restoration, not on the stone. The same regex that strips one bracket-pair to produce the interpretive version, with a single character of difference ([^\]]* instead of nothing inside the brackets), produces the conservative one. SDAM ships both. EDCS ships neither — only its own one transcription, with no record of which form it chose.
最后两行印出完整故事。诠释版读作"献给 Iulia 的诸神之灵,她活了 20 岁"。保守版读作"献给 Iulia 的神(缺)……"—— 因为 Manibus 是修复,并不在石上。同一个把括号去掉而产生诠释版的正则,加上方括号内表达式 [^\]]* 一处微调,就生成保守版。SDAM 同时发布两者;EDCS 都不发布,只有它自己的某一种转写,且不记录选了哪种。
From print to digital: what survives the pipeline?从印本到数字:哪些东西在流水线中幸存?
Now connect the code mechanics back to the print editions. The case study shows the same inscription transcribed by Mommsen in 1883 (CIL X 7296) and Kaibel in 1890 (IG XIV 297). What does an SDAM cleaning pipeline ingest from those — and what does it lose? See the side-by-side print SVGs in the case study.
现在把代码机制与印本对照。案例研究展示了 Mommsen 1883 (CIL X 7296) 与 Kaibel 1890 (IG XIV 297) 对同一铭文的两次转录。SDAM 清洗流水线从中摄入了什么、又丢失了什么?见案例研究中的并置印本 SVG。
| Survives幸存 | The text itself · place-name · catalog cross-references (CIL/IG numbers in publicationStmt) · approximate date | 文本本身 · 地名 · 编号交叉引用(CIL/IG 编号在 publicationStmt 中)· 近似日期 |
| Mostly lost大部分丢失 | Editor's argumentation prose · provenance history ("formerly with the Jesuits") · per-edition typographic decisions · the dating reasoning chain | 编者论证的散文 · 流转史("曾在耶稣会")· 每一版的排版决定 · 定年的推理链条 |
| Entirely lost彻底丢失 | Editor's interpretive judgment · scholarly disagreement between editors · the difference between Mommsen's and Kaibel's word-spacing choices · the marginalia of generations of readers | 编者的诠释性判断 · 编者之间的学术分歧 · Mommsen 与 Kaibel 在词间空格上的不同选择 · 一代代读者留下的批注 |
The JDH 2021 paper is therefore not the first abstraction of this inscription. It's the seventh or eighth — and the only one that documents its compromises in code. Reproducibility is not nostalgia for "the original;" it's making the loss legible.
因此 JDH 2021 论文并不是这块铭文的第一次抽象,而是第七或第八次,也是唯一一次用代码把妥协本身文档化的抽象。可复现性不是对"原物"的怀旧,而是让损失变得可读。
§ 5Helper packages辅助包 Visual视觉版
Two libraries hide the ugly plumbing — sciencedata.dk authentication and EDH-specific data wrangling — so analysis notebooks stay short.
两个库把麻烦事(sciencedata.dk 鉴权与 EDH 数据整理)封装起来,让分析笔记本保持简短。
sddk (Python) sdam-au/sddk_py
One-line access to sciencedata.dk, the Danish national research-data service that hosts every SDAM cleaned dataset. Wraps WebDAV authentication and IO for JSON/CSV/parquet/pickle. MIT-licensed, on PyPI.
一行访问 sciencedata.dk,丹麦国家科研数据服务,托管 SDAM 所有已清洗数据。封装了 WebDAV 鉴权与 JSON/CSV/parquet/pickle 读写。MIT 许可,已发布 PyPI。
# pip install sddk import sddk s = sddk.cloudSession("sdam_au") EDH = s.read_file("SDAM_root/SDAM_data/EDH/public/EDH_text_cleaned_2022_11_03.json", "json")
Reading from public folders does not require sddk — direct HTTPS works with pandas.read_json(). Authentication is only needed to write or to access non-public team folders.
读公共文件夹不需要 sddk,用 pandas.read_json() 直接走 HTTPS 即可。鉴权只在写入或访问非公开团队文件夹时才需要。
sdam (R, on CRAN) sdam-au/sdam
The R counterpart, with significantly more domain logic baked in. Available from CRAN with install.packages("sdam"). Ships an EDH dataset built in (84,701 inscriptions), live API access via get.edh(), and a probability of existence function (prex()) that bins date ranges across periodization schemes.
R 端对应包,内含明显更多领域逻辑。CRAN 提供 install.packages("sdam")。内置 EDH 数据集(84,701 条铭文),通过 get.edh() 实时访问 API,并提供 存在概率 函数 prex(),按时段方案对日期范围分箱。
library(sdam) data(EDH) length(EDH) # [1] 84701 iud <- get.edh(search="inscriptions", province="Iud") prex(x=iud, vars=c("not_before", "not_after"), cp="bin5")
§ 6Analysis projects分析项目 Visual: gallery视觉版:画廊
SDAM publishes the analyses, not just the data. Each substantive research question lives in its own repository, citing cleaned datasets by DOI.
SDAM 不仅发布数据,也发布分析。每个具体研究问题都有独立仓库,通过 DOI 引用所用清洗数据。
digital_epigraphy
The flagship. Companion to Heřmánková, Kaše & Sobotková 2021 (JDH). Uses both EDH and EDCS to revisit MacMullen's "epigraphic habit" thesis at full corpus scale.
旗舰仓库。Heřmánková、Kaše、Sobotková 2021(JDH)的配套。同时使用 EDH 与 EDCS,在全语料尺度重审 MacMullen 的"铭文习俗 (epigraphic habit)"假说。
formulae
Quantitative analysis of recurring epigraphic formulae — phrases like D(is) M(anibus), vix(it) ann(os), p(osuit). Tracks geographic and chronological diffusion as a window onto Roman cultural networks.
对铭文常见套语 (formulae)(如 D(is) M(anibus)、vix(it) ann(os)、p(osuit))的定量分析。通过空间与时间扩散追踪罗马文化网络。
NLP_inscriptions
Materials from a Connected Past 2021 presentation summarizing SDAM's NLP experiments on inscription corpora — tokenization, vector embeddings, clustering of formulae.
来自 Connected Past 2021 演讲的材料,总结 SDAM 在铭文语料上的 NLP 实验:分词、向量嵌入、套语聚类。
epigraphic_roads
"Quantitative analysis of inscriptions, detecting road networks in the ancient Mediterranean from inscriptions." Uses inscription geo-distribution as a proxy signal for Roman road infrastructure.
"通过铭文地理分布检测古地中海道路网络。"以铭文地理分布作为罗马道路基础设施的代理信号。
social_diversity
"Division of Labor and Occupational Specialization and Diversification in the Ancient Roman Cities." Mines occupational titles in inscriptions to measure economic complexity across the urban network of the empire.
"古罗马城市中的分工与职业专业化、多样化。"挖掘铭文中的职业称谓,衡量帝国城市网络的经济复杂度。
landscape_prominence + landscape-travel
Two R packages for landscape archaeology. landscape_prominence computes which Roman sites are visible from where. landscape-travel calculates pedestrian travel times across dryland between Mediterranean settlements.
两个景观考古 R 包。landscape_prominence:计算各罗马遗址间的可视性。landscape-travel:计算地中海陆上聚落间的步行旅行时间。
coins · PIA · ASCNET · perachora-medit-arch · OCR · epigraphic_cleaning
Project- and method-specific repos. coins applies SDAM's data discipline to numismatics. perachora-medit-arch is the digital supplement to a Mediterranean Archaeology Journal article. OCR and epigraphic_cleaning are older text-prep experiments whose lessons have been folded into EDH and EDCS transform stages.
项目与方法专属仓库。coins 把 SDAM 的数据规范用于钱币学。perachora-medit-arch 是 Mediterranean Archaeology Journal 论文的数字补充。OCR 与 epigraphic_cleaning 是较早的文本预处理实验,其经验已并入 EDH 与 EDCS 的转换阶段。
§ 7aExtract mechanics提取机制详解 Visual视觉版
"Extract" looks different in each pipeline because each source's affordances differ. Unifying principle: don't transform yet.
每条流水线的"提取"长得不同,因为每个数据源的接口形态不同。共同原则:暂时不做任何转换。
Three extraction patterns三种提取模式
API walking (EDH). The source offers a programmatic endpoint. Iterate with pagination, save raw JSON. Cheapest and most reliable — when available.
API 遍历(EDH)。数据源提供编程接口。分页迭代,保存原始 JSON。如有可用接口,是成本最低、最可靠的方式。
Web scraping (EDCS). Source offers only an HTML interface. Use a headless browser to walk pages. Slow, fragile, but sometimes the only option.
网页抓取(EDCS)。数据源只有 HTML 界面。用无头浏览器访问页面。慢且脆弱,但有时是唯一选择。
Bulk file ingestion (GI / LAGT). Source releases a zip or git repository of structured files (CSV, XML, TEI). Download once, parse en bloc. Cleanest of the three when available.
批量文件摄入(GI / LAGT)。数据源以 zip 或 git 仓库形式发布结构化文件(CSV、XML、TEI)。一次下载、整批解析。如可行,是三种方法中最干净的。
§ 7bTransform mechanics转换机制详解 Visual视觉版
Transform notebooks share a discipline: each one takes a single named input file, produces a single named output file, never reaches across multiple steps.
所有转换笔记本遵循同一纪律:每个笔记本接收一个具名输入文件,产生一个具名输出文件,决不跨多个步骤。
What gets standardized标准化对象
- Dates. Coerce to integers. Distinguish missing from "uncertain". Preserve original prose.
- 日期。强制转为整数。区分"缺失"与"不确定"。保留原始文字描述。
- Places. Resolve free-text against the Pleiades gazetteer.
- 地点。把自由文本对照 Pleiades 古地名词典消解。
- Categorical fields. Lower-case, trim, collapse synonyms.
- 分类字段。小写化、去空格、合并同义词。
- Inscription text. Produce both interpretive (restorations included) and conservative (only what's on the stone) variants.
- 铭文文本。同时产出"诠释版"(含修复字符)与"保守版"(仅石上字符)。
The "keep the original" rule"保留原始"规则
Every cleaned attribute is stored alongside its original under a suffixed name (e.g. type_of_inscription + type_of_inscription_clean). Cleaning is reversible.
每个清洗后的属性与原值并存,使用后缀名(如 type_of_inscription + type_of_inscription_clean)。清洗是可逆的。
§ 7cLoad mechanics加载机制详解 Visual视觉版
Two destinations, two purposes.
两个目的地,两种用途。
sciencedata.dk (working)(工作存储)
Each pipeline writes intermediate and final files to a SDAM-owned folder on sciencedata.dk. Public folders are mirrored to a hashed URL anyone can read with no authentication.
每条流水线将中间与最终文件写入 sciencedata.dk 上的 SDAM 文件夹。公共文件夹镜像为带哈希的 URL,任何人无须登录即可读取。
Zenodo (permanent)(永久存档)
For each major release, two artifacts are deposited with separate DOIs: one for the dataset, one for the scripts (a tarball of the GitHub release).
每次重大版本发布,分别存档两个 DOI:一份给数据集,一份给脚本(GitHub 发布的压缩包)。
| Dataset 2022 | 10.5281/zenodo.7303886 |
| Scripts 2022 | 10.5281/zenodo.7303867 |
§ 8Quick start快速上手 Visual视觉版
Use the data without cloning anything.
无需克隆任何仓库即可使用数据。
Latin inscriptions (EDH only)拉丁文铭文(仅 EDH)
import pandas as pd EDH = pd.read_json( "https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/EDH_text_cleaned_2022_11_03.json" ) # 81,883 inscriptions ready
Merged Latin (LIRE)合并拉丁文(LIRE)
import geopandas as gpd LIRE = gpd.read_parquet("https://zenodo.org/record/7577788/files/LIRE_v2-1.parquet?download=1")
Greek texts (LAGT)希腊文献(LAGT)
import pandas as pd LAGT = pd.read_parquet("https://zenodo.org/records/13889714/files/LAGT_v4-1.parquet?download=1")
R routeR 方式
install.packages("sdam")
library(sdam)
data(EDH)
§ 9Citations & DOIs引用与 DOI
Method papers方法论文
- Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004
- Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. CHR 2023, 413–25. CEUR-WS
Datasets数据集
- EDH 2022 v2 — 10.5281/zenodo.7303886
- EDCS 2022 v2 — 10.5281/zenodo.7072337
- LIRE — zenodo.org/record/7577788
- LIST — zenodo.org/record/7870085
- LAGT v4.1 — zenodo.org/records/13889714
Source databases数据源
- EDH — edh.ub.uni-heidelberg.de
- EDCS — manfredclauss.de
- PHI Greek Inscriptions — inscriptions.packhum.org
- Pleiades — pleiades.stoa.org
- Trismegistos — trismegistos.org