CORAL
Project description
Statutory requirements pertaining to language model training data and the transparency of generated texts must be fulfilled when putting language models to practical use in professional contexts. The CORAL project is addressing these challenges and researching methods for constructing and using language models which are subject to legal, technical and qualitative constraints. For this, we are systematically investigating constrained training methods for large language models (LLMs) and the retrieval-augmented generation of texts.
The research being carried out in CORAL is based on data which we obtain from exclusive sources provided by our project partners. These include the German National Library's digital holdings, web crawls from the Internet Archive and Common Crawl which go back several decades and encompass petabytes of data, news crawls from Wortschatz Leipzig going a long way back and focusing on European languages, and proprietary data from our associates in the field of finance.
We are examining how far these data can be lawfully used in obfuscated form for training language models and how far the data can be obfuscated if useful language models are to be constructed. This will yield extensive knowledge for companies and authorities whose data have hitherto been unavailable for use in AI applications due to similar constraints.
As a rule, language models used in professional applications should only generate texts which serve their intended purpose, are not fictitious ("hallucinated"), contain claims substantiated by references to sources, do not breach copyright and are not plagiarised. Until now, these requirements have not sufficed as constraints when training language models. There have been numerous (social) media cases which show that despite measures such as language model alignment or guardrail instructions in prompts, users regularly find ways to utilise the models for other purposes or disclose protected information such as training data. We are addressing these problems by investigating new methods of retrieval-augmented generation (RAG), firstly to enrich texts with expertise from various modalities and secondly to guarantee that training data are protected.
Project framework
Funding body
Federal Ministry of Education and Research (BMBF)
Partners
- Institute for Applied Computer Science (InfAI) at Leipzig University
- University of Kassel
- Anhalt University of Applied Sciences
- German National Library
Duration
1 October 2024 to 30 September 2027
Contact
Dr. Peter Leinen
p.leinen@dnb.de
Philippe Genet
p.genet@dnb.de
Last changes:
04.02.2025