A collection of historical and named-entity lexica for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovene, Spanish and Latin.
IMPACT Ground Truth and Image Dataset
More than half a million representative text-based images compiled by a number of major European libraries.
Dataset of ICDAR 2019 Competition on Post-OCR Text Correction
The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS).