Datasets - IMPACT Centre of Competence

IMPACT Language Resources

Impact Centre of Competence 14 September, 2023

A collection of historical and named-entity lexica for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovene, Spanish and Latin.

IMPACT Ground Truth and Image Dataset

Impact Centre of Competence 13 September, 2023

More than half a million representative text-based images compiled by a number of major European libraries.

Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

Impact Centre of Competence 13 September, 2023

The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS).

Natural History Museum Lepidoptera

Impact Centre of Competence 13 September, 2023

This dataset contains contains scans of index cards from the UK’s Natural History Museum lepidoptera index

REID2017

Impact Centre of Competence 13 September, 2023

Example and evaluation dataset used for the ICDAR2017 Competition on Recognition of Early Indian printed Documents

HBR2013

Impact Centre of Competence 13 September, 2023

Example and evaluation dataset used for the ICDAR2013 Competition on Historical Book Recognition.

GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Impact Centre of Competence 13 September, 2023

GT4HistOCR contains ground truth for research in Optical Character Recognition (OCR) technology applied to historical printings in German Fraktur and Early Modern Latin.

CIS OCR Workshop v1.0: OCR and postcorrection of early printings for digital humanities

Impact Centre of Competence 13 September, 2023

The 2-day CIS OCR Workshop on “OCR and postcorrection of early printings for digital humanities” originally held at LMU, Munich 14/15 September 2015 (see http://www.cis.lmu.de/ocrworkshop).