文本标注也可以在普通的文本编辑器中完成,那么为什么要用MARKUS呢?对于其默认的名称标记,MARKUS使用了中国大陆、中国台湾、韩国以及佛教研究的权威学术数据集(图1)。后面我会再解释这样做的优势。此外,关键词标记模块提供了一系列的功能,可以输入关键词列表,可以产生“关键词索引”(KWIC, Key Word In Context)列表或对任何语言的文本进行标记的正则表达式,还可以从用户上传的任何文本中选择词语进行相似性测试,据此来检测相关关键词。对于大型文本语料库,批量标记功能可用于同时标记几十个或几百个文件中的名称、关键词或正则表达式,只要这些文件已在MARKUS文件管理中上传。而利用相关的文本比较工具COMPARATIVUS,读者可以检测两个或多个文本的重合情况,从表格或文本中选择有意义的重合段落,并将选定的段落作为标记发回到MARKUS中的相关文件。默认的比较设置已经针对中文文本进行了优化,但仍可进行修改。例如,进行定位和保存特定文本的引文时,默认情况下,由COMPARATIVUS发回的语料会被标记为标准标签类型(comparativus),但标签名称可以在MARKUS中编辑,以区分不同文本的引文。
Creating, Linking, and Analyzing Chinese and Korean Datasets: Digital Text Annotation in MARKUS and COMPARATIVUS
Hilde De Weerdt
Abstract:
MARKUS, a multilingual digital text annotation and analysis platform, allows historians and other researchers to construct datasets from primary sources available to them in full-text digital format. Originally designed for those working with pre-twentieth- century Chinese texts, MARKUS has developed into a multifunctional annotation platform that is particularly suited for the automated annotation, referencing, and visualization of named entities in modern and literary Chinese and premodern Korean texts, but many of its additional annotation features can be used to analyze and read texts in any language, as long as the electronic documents are encoded in the most common standard for language encoding, Unicode. Below I discuss the main goals and methodological features of MARKUS and the allied text comparison utility COMPARATIVUS. I will illustrate these with some examples of how MARKUS has been used in Chinese and Korean historical research.
Keywords:
MARKUS; COMPARATIVUS; Automated Annotation; Linking Datasets; Chinese and Korean Datasets
Hilde De Weerdt, “Creating, Linking, and Analyzing Chinese and Korean Datasets: Digital Text Annotation in MARKUS and COMPARATIVUS,”
Journal of Chinese History
, vol. 4, no. 2, 2020, pp.519- 527.
感谢本文作者魏希德教授授权本刊发表中文版。
MARKUS的资助方包括:
欧洲研究委员会(根据欧盟第七研发框架计划[FP7/2007-2013]、欧洲研究委员会资助协议[ERC grant agreement] n°283525资助,MARKUS初始开发由魏希德和何浩洋负责,而COMPARATIVUS的开发由魏希德、Gelein和何浩洋负责)http://chinese-empires.eu、美国国家人文基金会、英国JISC数据挖掘挑战赛(Digging into Data Challenge)(机器学习模块由苗圣法负责研发)http://did-acte.org/,还有莱顿大学亚洲现代性与传统资助项目(Asian Modernities and Traditions Large Grant, K-MARKUS的开发由魏希德、Gelein、何浩洋、胡静、金把路、金炫等负责)
Paul Vierthaler(李友仁), Mees Gelein“, A BLAST-based, Language-agnostic Text Reuse Algorithm with a MARKUS Implementation and Sequence Alignment Optimized for Large Chinese Corpora,”
Journal of Cultural Analytics
, March 18, 2019, DOI: 10.31235/osf.io/7xpqe。
[2]在这些例子中,从标注中提取的数据(以一系列不同的文件格式从MARKUS中导出或导入到其他工具和平台)只是附加功能。结构和语义标注也可以将历史档案进行拓展。例如,我和Gabe van Beijeren一起准备了《贞观政要》的数字版(
The Essentials of Governance from the Reign of Constancy
)及其全文翻译。与常规的印刷版甚至数字版相比,我们的数字版可以提供一种非常不同的阅读方式。读者可以观察到手稿和印刷版之间的细微差别、以各种方式重新排列文本,而且可以基于MARKUS的标注,按照时间顺序、话者或出现的人物来筛选段落。读者还可以访问链接的参考资料,以进一步查找任何相关词语的参考信息。更实用的是,标签也可以用于编辑。官职、地名、人名、书名或关键概念的列表都可以很容易地导出,以便于规范翻译或者创建索引。
Hilde De Weerdt, Gabe van Beijeren, Mees Gelein,“Reading
The Essentials of Governance from the Reign of Constancy Revealed
Digitally,”2020,
https://chinese-empires. eu/zgzy
.
[3]Chu Mingkin,“The Secret of Long Tenure: A Study of Zheng Gangzhong’s Letters to Qin Hui’s Associates,”
T’oung Pao
, vol. 102, nos. 1-3, 2016, pp. 121-160;朱铭坚:《金元之际的士人网络与讯息沟通——以〈中州启札〉内与吕逊的书信为中心》,《北大史学》2016年第20期。文本、数据和交互式阅读平台可在如下地址获取:
[5]Lik Hang Tsui, Hongsu Wang,“Harvesting Big Biographical Data for Chinese History: The China Biographical Database (CBDB),”
Journal of Chinese History
, vol. 4, no. 2, 2020, pp. 505-511, DOI: 10.1017/ jch.2020.21.
[7]Huimin Bhiksu, Aming Tu, Marcus Bingenheimer, Jen-Jou Hung, et al., Buddhist Studies Authority Database Project, 2008, http://authority.dila.edu.tw/; Peter Bol, Lex Berman, et al., China Historical GIS, 2001,
https://www.fas.harvard.edu/∼chgis
.
[8]Hilde De Weerdt“, TheUsesofDigitalPhilologyinTang-SongHistory-Part2,”MARKUSForum:Research Blogs, January 14, 2017,
[10]Peter Bol“, The Visualization and Analysis of Historical Space,”
Journal of Chinese History
, vol. 4, no. 2, 2020, pp. 511-519, DOI:10.1017/jch.2020.22.
[11]Hilde De Weerdt“, TheUsesofDigitalPhilologyinTang-SongHistory-Part1,”MARKUSForum:Research Blogs, January 14, 2017,
[12]Donald Sturgeon“, Digitizing Premodern Text with the Chinese Text Project,”
Journal of Chinese History
, vol. 4, no. 2, 2020, pp. 486-498, DOI:10.1017/jch.2020.19.
[13]Hilde De Weerdt, Brent Ho,
Information, Territory, and Networks: The Crisis and Maintenance of Empire in Song China
, accompanying data and visualization site, 2015.