Next on stage is Niall Anderson from the British Library, talking about public participation in mass digitisation, and “why we think that’s a good idea”.
A 2008 Conference of European Libraries survey estimated that that there were now some 8 million digitised text-based items in existence, proof that we live in an era of effective mass digitisation. However, British Library research has suggested that some 20% of all digital text produced is effectively unreadable due to poor capture, OCR deficiencies, and difficulties native to the source material. If the point of mass digitisation is to enable wider public use of previously inaccessible material, then this shortfall in readable text must be addressed. Niall outlines the history and future of collaborative correction in mass digitisation by demonstrating IMPACT’s CONCERT tool, which will make more digital documents available through crowdsourcing and structured correction of text.
[slideshare id=4138315&doc=bratislavaws-anderson-bl-concert-100518090418-phpapp02]
ARVE Error: need id and providerCONCERT (Cooperative Engine for the Correction of Extracted Text) works in three steps: character session, word session and page-level session. Character session presents the user with a list of characters the OCR has characterised as the same letter. The user can then mark characters as “suspicious”. In the next step, theses characters are presented in word context, where the user can again decide if the characters were recognised correctly. In the final step, characters and words that are still marked as suspicious are shown on page-level.
Mark-Oliver Fischer, BSB