Claus Gravenhorst from CCS presented a case study from the Koninklijke Bibliotheek (KB) and the Content Conversion Specialists GmbH (CCS) considering if the new FineReader Engine and Dutch Lexicon increase OCR accuracy and production efficiency.
CCS were interested in working with IMPACT as they were aware of the use of 9 languages and felt they could benefit from technological improvements in the area of OCR.
Claus reiterated that various pre- and post- processing steps can have an effect on accuracy as well as the image quality. He explained that the test material they had chosen were 17th Century Dutch newspapers part of a DDD database. A typical page would have two colours and gothic fonts and
The test system used was docWorks which was developed during the EU FP5 project METAe (of which ABBYY was involved). The system has previously been used for small, mid and large scale projects. The workflow covered item tracking from the shelf, scanning and back to the shelf including QA etc. This system was used to integrate IMPACT tools. There was very little pre-processing as the focussing was the OCR. Zones were classified and then passed to the OCR engine. At the end analysis was carried out to understand the structure of the page.
The IMPACT tools used were ABBYY FineReader Engine 10 and external dictionaries used on the DDD material. The goal was to generate statistical data for character and word accuracy of all 4 test runs . An improvement was shown between FineReader 9 and FineReader 10 and the biggest improvement was shown when using the dictionaries. There was a 20.6% word accuracy improvement when using IMPACT tools. In laypersons terms this means that if you had to correct 100 words with IMPACT you would only have to correct 80. Claus showed some screen shots of the docWorks text correction mode.
To conclude Claus explained that ABBYY OCR and historical dictionaries enable higher text accuracy and lower the correction effort.
View the presentation here:
and video here:
http://www.vimeo.com/31999737