A collection of historical and named-entity lexica for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovene, Spanish and Latin.
IMPACT Ground Truth and Image Dataset
More than half a million representative text-based images compiled by a number of major European libraries.
Dataset of ICDAR 2019 Competition on Post-OCR Text Correction
The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS).
BVMC Linked Open Data
The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard.