Go to:

German National Library closed during Easter

18 to 21 April 2025: The German National Library will be closed at both locations. The exhibitions of the German Museum of Books and Writing will open from 10:00 to 18:00.

CORAL

In the background people working in a project group, project title in the foreground.

Project description

Statutory requirements pertaining to language model training data and the transparency of generated texts must be fulfilled when putting language models to practical use in professional contexts. The CORAL project is addressing these challenges and researching methods for constructing and using language models which are subject to legal, technical and qualitative constraints. For this, we are systematically investigating constrained training methods for large language models (LLMs) and the retrieval-augmented generation of texts.

The research being carried out in CORAL is based on data which we obtain from exclusive sources provided by our project partners. These include the German National Library's digital holdings, web crawls from the Internet Archive and Common Crawl which go back several decades and encompass petabytes of data, news crawls from Wortschatz Leipzig going a long way back and focusing on European languages, and proprietary data from our associates in the field of finance.

We are examining how far these data can be lawfully used in obfuscated form for training language models and how far the data can be obfuscated if useful language models are to be constructed. This will yield extensive knowledge for companies and authorities whose data have hitherto been unavailable for use in AI applications due to similar constraints.

As a rule, language models used in professional applications should only generate texts which serve their intended purpose, are not fictitious ("hallucinated"), contain claims substantiated by references to sources, do not breach copyright and are not plagiarised. Until now, these requirements have not sufficed as constraints when training language models. There have been numerous (social) media cases which show that despite measures such as language model alignment or guardrail instructions in prompts, users regularly find ways to utilise the models for other purposes or disclose protected information such as training data. We are addressing these problems by investigating new methods of retrieval-augmented generation (RAG), firstly to enrich texts with expertise from various modalities and secondly to guarantee that training data are protected.

Project framework

Funding body

Federal Ministry of Education and Research (BMBF)

Partners

Institute for Applied Computer Science (InfAI) at Leipzig University
University of Kassel
Anhalt University of Applied Sciences
German National Library

Duration

1 October 2024 to 30 September 2027

Contact

CORAL-Logo

Dr. Peter Leinen
p.leinen@dnb.de

Philippe Genet
p.genet@dnb.de

Last changes: 04.02.2025

Navigation and service

Main Menu

German National Library closed during Easter

Search website

CORAL

Project description

Project framework

Funding body

Partners

Duration

Contact

DNB Leipzig

DNB Frankfurt

Navigation and service

CORAL

Project description

Project framework

Funding body

Partners

Duration

Contact

Our newsletters

DNB Leipzig

DNB Frankfurt