Description
More than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment. Ca. 50,000 GT filesDataset content type
GroundtruthImages
Metadata
Dataset scope
Layout analysisPostcorrection
OCR
Language
BulgarianCzech
Dutch
English
French
German
Polish
Slovene
Spanish
Size
ca. 500,000 images and ca.50,000 GT filesDataset License
CC - Attribution NonCommercial NoDerivatives or equivalentCC - Attribution NonCommercial ShareAlike or equivalent
CC - Attribution ShareAlike or equivalent
Public domain