In the final months of the IMPACT project in 2012, the KB worked together with one of the current digitisation projects Early Dutch Books Online (EDBO) to research various methods of improving OCR. EDBO is a combined effort between the KB and the university libraries of Leiden and Amsterdam to digitise 2 million pages of books from 1780 to 1800. After the digitisation process, they decided to hire a number of students to manually correct certain books. This provided the ideal opportunity for IMPACT to tag along and have some of these students work with a number of IMPACT tools.
Having done this pilot, we learned a lot about the tools, what they needed as input, what they provided as output and how they should be used. The results gave us an idea of what we could achieve with each tool, but we all knew that we could only use those as an indication. The differences between the tools and their methods were too great to base a decision on.
Goals
However, what we can say is this: When you want to improve your OCR, it is very important that you have a clear goal in mind. You should ask yourselves at least the following questions: How much better should the OCR be? How much money would we like to spend? How much effort can we spare and from whom? Is improving the OCR the only goal or do we also have others in mind, such as crowdsourcing? The answers to these questions can all result in very different ideas about the OCR improvement project and consequently the best tool for you.
When to use which tool?
To make it easier, we”™ve divided our tested tools into three categories: Basic tools, Advanced tools and re-OCRing.
Basic tools |
Advanced tools |
Re-OCRing |
Alto Edit |
LMU Profiler and Post correction tool |
ABBYY FRE 10 with a historical Dutch dictionary |
PlaIR platform |
(not possible to test in this pilot, because of its setup) |
Adaptive OCR |
Basic tools
The Basic tools are the easiest to use. They require (almost) no training and are web-based. These would be perfect for involving the crowd or other volunteers. You would need to have back-up from within the library and your infrastructure should be able to handle all the improved data all the time. It would be possible to get a very high OCR accuracy though, with many people working on the material.
Advanced tools
The Advanced tools require more training and it is even imaginable that they are used by library staff only. However, they do provide more functionalities and a higher correction speed than the Basic tools, because of the batch corrections (LMU) and carpet sessions (CONCERT). Both tools can get a very high accuracy when used to their fullest, but that would require some time.
Re-OCRing
Re-OCRing would be a very good option when you want to spend very little (manual) effort and have some money to spend on licenses. It would also be static, which would be an advantage to some library infrastructures and would also improve the OCR quite a bit. Especially when also plugging in a historical dictionary, which have been produced in nine languages in IMPACT.
Finally
This pilot was done with KB people and KB material, with the KB infrastructure in mind, so when you (or your library) thinks about OCR correction, please do a pilot of your own. We”™ve learned a great deal about what we think is important for us, what is possible and what material and tools would be our best fit, but that might be very different for each library. We would of course be happy to help and share our experiences via the Centre of Competence.