- Automated collation of literary and historical texts
- Data Scopes
- Text as a Graph
- Culturally Aware AI
- Global Apple Pie
Collation, an important step in scholarly editing, involves comparing two or more versions of a work. The aim of this project is on the one hand to reflect on the collation process, where we deal with questions such as “how do we define ‘textual variation’?” and “what are the methodological ramifications of automating this important step in text research?” We elaborate these and similar research questions in theoretical contributions and by making (prototype) collation tools. This includes the tool CollateX, which was developed in 2011 and is used by many international edition projects.
In addition, we are working on the further development of the advanced collation software HyperCollate, in which certain information in the markup is used to include variation within one text witness (such as deletions or additions) in the text comparison. In this way, researchers gain a better and more detailed understanding of the text variation. An important part of this project is investigating how collation software can process longer texts. Also, the order in which text witnesses are entered should not affect the collation result. These are two complex problems that scientists have been grappling with for some time, and solving them will be an important step for the field.
Finally, the project deals with the challenge: how do you visualize such a detailed collation result?
Contact: Elli Bleeker & Ronald Haentjens Dekker
Many large digital text collections and computational tools that are available online today, allow humanities scholars to address a wide array of research questions. This often involves many data transformations: gathering and selecting documents, extracting and modelling the relevant data in them, cleaning and normalising this data, and linking dispersed information both within the resource under study and to external resources. All these transformations change the nature of the data and require interpretation to make informed choices. We are developing Data Scopes as an instrument to make this transformation process transparent and explicitly linked to the research questions.
The Data Scopes instrument has several components:
With computational humanities research, the research process has changed, but how is this reflected in how we publish our research? A traditional historical narrative focuses on how connecting the evidence from dispersed documents gives us new perspectives and insights, but leaves out the many data processing steps that were taken. One question Data Scopes aims to address is how we can publish digital research in multiple layers connecting the narrative with the data and processing.
Each corpus has its own characteristics, which determines what kinds of information we can extract from it and how we can organise these into layers of annotations:
- How do the characteristics of a corpus translate to the types of information that can be used to organise and give access to the corpus?
- How do we translate potential research questions into relevant annotation layers and data queries for a corpus?
- How do we give insight into what kinds of layers of annotations and research questions are possible given the data?
Different amounts of data require different ways of accessing and processing them.
- How do the needs for data elaboration change with different amounts of data?
- Which methods are effective when we have dozens of documents, or if we have thousands or millions of documents? How do we translate qualitative methods for small amounts of data to quantitative methods for larger amounts of data? How do we translate methods that rely on statistics derived from hundreds of thousands of documents to a corpus that only has a few thousand documents?
- How can we connect micro and macro perspectives,at what scale do we switch perspectives and why? Where are the transitions in relevant aspects and patterns?
- How can we iteratively zoom in and out for switching between micro and macro and between qualitative and quantitative?
To develop and test this instrument, we apply to a number of projects, including REPUBLIC (a corpus of political historical documents like meeting notes, decisions, Ordonnances), Migrant: Mobilities and Connection (which includes institutions, policy archives, card indexes, registers and (distributed) databases), and Impact of Fiction (focussing on book reviews and discussions).
Contact: Marijn Koolen & Rik Hoekstra
Computers have become indispensable for the storage, dissemination, and representation of literary and historical sources. What is more, computers can be used not only to assist our research practice, but also as research instruments in and by themselves. This means we have to represent texts in a way that is computationally processable. This entails thinking of text in radically different ways: text can be represented as a string of characters, as a tree, or as a graph. The project TAG, short for Text-As-Graph, examines the editing of literary and historical texts in a hypergraph data model. On a conceptual level, the graph model is interesting because it means letting go of the idea of a hierarchy and thus thinking without constraint about how we want to model our text as well as our knowledge of it. On an implementation level, we examine how the graph can best be used by the scholarly editor for it to become a proper research instrument.
Contact: Elli Bleeker & Ronald Haentjens Dekker
Odeuropa will apply state-of-the-art AI techniques to cultural heritage text and image datasets spanning four centuries of European history, to identify and trace how ‘smell’ was expressed in different languages, with what places it was associated, what kinds of events and practices it characterised, and to what emotions it was linked. This multi-modal information will be curated, following semantic web standards, stored in the ‘European Olfactory Knowledge Graph’ (EOKG), and then drawn on to create new ‘storylines’ informed by cultural history research. The storyline resources will be prepared in different formats for different audiences: as an online ‘Encyclopaedia of European Smell Heritage’, as ‘interactive notebook’ demonstrators, and in the form of toolkits and training documentation describing best-practices in olfactory museology. We will develop new, evidence-based methodologies to quantify the impact of multisensory visitor engagement, and use this data to support the implementation of policy recommendations for the recognition, promotion, presentation and safe-guarding of our olfactory heritage.
DHLab involvement: Marieke van Erp is involved as project manager and co-developer of smell reference language and semantic web technology.
The AI:CULT project addresses the gap between AI and our digital cultural heritage. Cultural heritage data is rarely objective data. The very reasons for certain heritage data to be preserved, its interpretation throughout time, and the way heritage data is accessed after digitalisation is all subject to strong biases. The inherent richness, subjectivity and polyvocal nature of cultural heritage data limits and often even rules out the responsible use of AI. How do we model that “Seventeenth Century” and “The Golden Age” refer to the same era, yet are not fully synonymous and carry different semantic payloads? Current state of the art AI cannot deal with these subtleties in a way that does justice to the important role of the heritage institute as a trusted source of information. Thus, the heritage sector is under threat to be left out of the current global success of AI. AI:CULT will allow heritage institutes to use AI in ways that align with their role in society: transparent, inclusive, and keeping the user in control.
The project addresses two case studies with societal parties tasked with providing access to national heritage, and who have voiced their vested interest in using AI for their workflows: the National Library (KB) and the Institute for Sound and Vision (NISV): (i) automatically analysing and enriching object-level descriptions and (ii) creating data stories and narratives from raw collection data. Both institutions acknowledge that the straightforward application of AI reflects biases present in the training data. In the AI:CULT project bias detection and filtering methods will be developed that will be directly tested on the heritage institutions’ workfloors.
DHLab involvement: Marieke van Erp is one of the work package leaders and Ryan Brate is a PhD student on this project.
Funded by The Dutch Research Council (NWO).
Global apple pie investigates the relationship between sugar import and export and recipes as well as opinions on health.
This project is a collaboration across HuC institutes with Ulbe Bosma (IISG) and Rebeca Ibañez-Martín (Meertens Institute).
Involvement DHLab: Marieke van Erp
The main aim of this Action is to promote synergies across Europe between linguists, computer scientists, terminologists, and other stakeholders in industry and society, in order to investigate and extend the area of linguistic data science. We understand linguistic data science as a subfield of the emerging “data science”, which focuses on the systematic analysis and study of the structure and properties of data at a large scale, along with methods and techniques to extract new knowledge and insights from it. Linguistic data science is a specific case, which is concerned with providing a formal basis to the analysis, representation, integration and exploitation of language data (syntax, morphology, lexicon, etc.). In fact, the specificities of linguistic data are an aspect largely unexplored so far in a big data context.
In order to support the study of linguistic data science in the most efficient and productive way, the construction of a mature holistic ecosystem of multilingual and semantically interoperable linguistic data is required at Web scale. Such an ecosystem, unavailable today, is needed to foster the systematic cross-lingual discovery, exploration, exploitation, extension, curation and quality control of linguistic data. We argue that linked data (LD) technologies, in combination with natural language processing (NLP) techniques and multilingual language resources (LRs) (bilingual dictionaries, multilingual corpora, terminologies, etc.), have the potential to enable such an ecosystem that will allow for transparent information flow across linguistic data sources in multiple languages, by addressing the semantic interoperability problem.
- Communicating the environmental impact of plant based recipes
- SABIO: The SociAl BIas Observatory
- Mining Wages in Nineteenth Century Newspaper Job Advertisements
- CLARIAH Amsterdam Time Machine
Plant-based diets are becoming popular across Europe but are not yet mainstream. If we are to shift diets across Europe, the recipes on offer must be appealing to consumers, and there must be evidence that dietary changes will make a difference. This research produces a tool that calculates the calories, the biodiversity, economic, and climate benefits of plant-based recipes. This will increase food-climate awareness in consumers, offering them a means to investigate tradeoffs and integrate sustainable healthy food into different European food cultures. Our targeted outputs will inform consumers, food professionals and policy makers. Outputs include:
- Functional environmental NLP tool for recipe analysis
- Database of the GHGE, cost, biodiversity, water, land use European plant-based recipes (EUPBR)
- Academic publication: analysis of the sustainability of EUPBR
- Consumer/chef guidance on how to adapt EUPBR to be more sustainable (workshop output)
- Summary of findings for policy makers/nutrition professionals (report/webinar)
Funded by: Alpro Foundation
Duration: January – December 2021
The SociAl BIas Observatory (SABIO) project is aimed at investigating bias in the digital collections of the members of the Dutch Digital Heritage Network. In this project, we investigate how collection managers and curators create and add metadata to collection objects, and how bias in these metadata can be detected using statistical models. We aim to create a knowledge graph on top of existing collection databases that makes prejudices and imbalances in the data explicit such that they can be addressed, as well as taken into account by users of the data.
DHLab involvement: Marieke van Erp is the principal investigator on this project, Valentin Vogelmann is involved as a researcher.
Funded by the Dutch Digital Heritage Network.
This project investigates scene detection to enrich a historical press photo collection.
DHLab involvement: Melvin Wevers (main applicant)
Funded by NWO. More info: https://www.nwo.nl/en/research-and-results/research-projects/i/33/34433.html
Newspaper advertisements contain valuable information on many socio-economic historical developments. In Digital History research, advertisements are mostly used to study goods, products and consumer society. Advertisements, however, were not only used to sell, but also to ask. Job advertisements feature frequently in the nineteenth century. This projects aims to computationally extract job advertisements from the nineteenth century digitized newspapers provided by the Royal Dutch Library. The goal of this project is to aggregate the wages that were mentioned in this advertisements to gain a better insight in the economic development of ‘keukenmeiden’ and ‘dienstbodes’.
Involvement DHLab: Ruben Ros, Marieke van Erp
May 2017 – December 2019
Historically, some animals have been perceived as threats by humans. These species were believed to carry diseases or harm crops and farm animals. SERPENS and its ATHENA extension aimed to study the historical impact of pest and nuisance species on human practices and changes in the public perception of these animals.
Involvement DHLab: Marieke van Erp
October 2017 – March 2019
Although oral history and the study of ego documents both value these individual perspectives on history and its meaning, these research fields tend to operate separately. EviDENce explores new ways of analysing and contextualising historical sources by applying event modelling and semantic web technologies.
Involvement DHLab: Marieke van Erp
April 2018 – January 2019
The Amsterdam Time Machine (ATM) is a research and development platform on the history of Amsterdam.
DHLab member involved: Marieke van Erp