Stefan Pletschacher (University of Salford) began by presenting an overview of the digitisation workflow and the issues at each stage. These stages are usually given as scanning (whatever goes wrong here is hard to recover later) image enhancement, layout analysis, OCR and post processing. He explained that you should evaluate at each step but should also consider the workflow as a whole. To carry out performance evaluation you need to begin with some images that are representative of the images you will be processing. You’ll then OCR the results.
There then followed an explanation of the concept of Ground truth – it is not just the final text but will also include other aspects, such as images to map to. Stefan explained that to do a good job regarding ground truth you really need to use several tools e.g. you can’t use Alto to look at certain aspects of character formation. The IMPACT ground truths have been produced using Alethia, now a fairly mature tool, which allows creation of information on page borders, print space, layout regions, text lines, words, glphs, Unicode text, reading order, layers etc. Groundtruth is more than just text. It can take on elements like deskewing, dewarping, border removal and binarisation. He suggested that institutions consider scenarios so they can decide what aspect of OCRing and what workflow is important to them.
Stefan also gave an introduction to the IMPACT Image repository where all the images and metadata have been collected and shared. The repository has allowed central management of metadata, images and ground truth and is searchable so you can filter on aspects of images.
Stefan finished his talk with an overview of the datasets available: 667,120 images approximately comprising of institutional datasets from 10 libraries (602,313 images) and demonstrator sets (56,141 images).
[slideshare id=9857419&doc=stefanpletschacher-111024084535-phpapp02]