Paul Fogel, Technical Lead of the Mass Digitisation team at the California Digital Library (CDL), presented digitisation experiences and challenges faced by CDL when dealing with OCR document text extraction. Fogel emphasised the difficulties and obstacles posed by bad OCR during the mass indexing and digitisation processes of cultural records: marginalia, image-text misinterpretations and fonts, as well as limited resources, the wide range of languages (400 to be exact but OCR dictionaries for only 20 of them) and disciplines and the project’s large indexing scale, make ranking results and their use extremely difficult. Fogel finally echoed Antonacopoulos and stressed the need for high quality images to ensure best indexing and query results.
View the presentation here:
[slideshare id=9857182&doc=pfogelimpactpresentation-111024083201-phpapp02]
and the video here:
http://www.vimeo.com/32006947