German Lexicon

ABSTRACT

The German historical corpus consists of 510 texts varying in length and including different genres. It contains 3,552,690 tokens (words in running text) and 369,730 types (unique words) in total. As the texts originate from 1350-1950, the German corpus contains material both from the Early New High German period (1350-1650) and the New High German period (since 1650), covering all subperiods as well.

The IR lexicon of historical German, has been built by means of the LeXtractor-tool developed by LMU. Up to now, 22,800 non modern entries with attestations in the available corpus material have been created. The lexicon contains 20,700 different historical strings, which means that attestations can be found for approximately 1,1 different readings of a string. 36,800 readings in total have been manually marked as feasible, but 14,000 of them could not be verified in the corpus. Of all 36,800 processed readings, 31,700 are pattern-based and 5,100 are “irregular”. These 36,800 readings point to 19,200 lemmata.

Parts of the underlying historical corpus for lexicon construction were built in collaborative work with the Bavarian State Library.

The Core Named Entities Lexicon for German is a set of named entities (historical German locations, person names and organisations) which are likely to appear in a wide variety of texts, with extensions specific to text types targeted by IMPACT according to scope information provided by the ONB (Austrian National Library) and BSB (Bavarian State Library). It can be used as a lexicon for OCR and for query expansion in retrieval.

PUBLICATIONS

PRODUCED BY

Centrum für Informations und Sprachverarbeitung (CIS), University of Munich

LICENCING

The Core Named Entities Lexicon for German license is still under decision.

For further information, please contact LMU IMPACT Group

DOWNLOAD