Navigation and service

Leipzig
18 to 24 February 2025

The German National Library in Leipzig will be closed from 18 to 24 February 2025, 11:00. The exhibitions of the German Museum of Books and Writing will open from 10:00 to 18:00.

Frankfurt
10 to 23 March 2025

The German National Library in Frankfurt am Main will be closed from 10 to 22 March 2025. The exhibitions of the German Exile Archive will open Monday to Friday from 9:00 to 21:30 and on Saturdays from 10:00 to 17:30.

Project description

Statutory requirements pertaining to language model training data and the transparency of generated texts must be fulfilled when putting language models to practical use in professional contexts. The CORAL project is addressing these challenges and researching methods for constructing and using language models which are subject to legal, technical and qualitative constraints. For this, we are systematically investigating constrained training methods for large language models (LLMs) and the retrieval-augmented generation of texts.

The research being carried out in CORAL is based on data which we obtain from exclusive sources provided by our project partners. These include the German National Library's digital holdings, web crawls from the Internet Archive and Common Crawl which go back several decades and encompass petabytes of data, news crawls from Wortschatz Leipzig going a long way back and focusing on European languages, and proprietary data from our associates in the field of finance.

We are examining how far these data can be lawfully used in obfuscated form for training language models and how far the data can be obfuscated if useful language models are to be constructed. This will yield extensive knowledge for companies and authorities whose data have hitherto been unavailable for use in AI applications due to similar constraints.

As a rule, language models used in professional applications should only generate texts which serve their intended purpose, are not fictitious ("hallucinated"), contain claims substantiated by references to sources, do not breach copyright and are not plagiarised. Until now, these requirements have not sufficed as constraints when training language models. There have been numerous (social) media cases which show that despite measures such as language model alignment or guardrail instructions in prompts, users regularly find ways to utilise the models for other purposes or disclose protected information such as training data. We are addressing these problems by investigating new methods of retrieval-augmented generation (RAG), firstly to enrich texts with expertise from various modalities and secondly to guarantee that training data are protected.

Project framework

Funding body

Federal Ministry of Education and Research (BMBF)

Partners

  • Institute for Applied Computer Science (InfAI) at Leipzig University
  • University of Kassel
  • Anhalt University of Applied Sciences
  • German National Library

Duration

1 October 2024 to 30 September 2027

Contact

CORAL-Logo

Dr. Peter Leinen
p.leinen@dnb.de

Philippe Genet
p.genet@dnb.de

Last changes: 04.02.2025

to the top