Jesse De Does from the INL gave a brief but rich presentation on the evaluation of lexicon supported OCR and the project’s recent improvements. To evaluate lexica in OCR, the FineReader SDK 10 is used. In short, the software measures OCR with a default included dictionary, and, for each word or fuzzy set, it gives a number of alternatives and segmentations. It is then up to the user to manually select the most suitable or probable option. Lexicon, however, may include errors and the fuzzy sets created by FineReader may be too small (we will never have all spelling variations or compounds). Thus, a number of actions, including word recall, dictionary cleaning and implementation of historical dictionaries, are taken in order to increase performance, even if by small percentages.
The languages analysed and improved so far are Bulgarian (the only non-Latin based language analysed), Czech, English (initially and mistakenly thought to be a no-brainer), French (good improvements overall), German (progress mainly accounted for 16th century), Polish, Slovene and Spanish. The use of historical lexica has produced an overall progress of 10% to 36%.
Finally, De Does mentioned experiments undertaken for the evaluation of IR. While a more complete evaluation is coming soon, performance in experiments with the English and Spanish languages has been measured using lemmatisation of modern lexica (e.g. OED IR for English).
View the presentation here:
[slideshare id=9875650&doc=jessededoes-111025105528-phpapp02]
and the video here:
http://www.vimeo.com/32505231