Geneviève Cron of the Bibliotheque Nationale de France (BnF) begins by discussing the BNF’s digital library: Gallica. A million documents digitised since 1992, with OCR as standard since 2005. OCR accuracy for newspapers is 98% on word level, but results are much more varied – from 60% up. For books, the average accuracy lies at 90%.
[slideshare id=4138323&doc=bratislavaws-cron-bnf-usecases-100518090456-phpapp02]
http://vimeo.com/11833984
She describes users of digital services: mostly French or Francophone; special access needs (vision impairment). Queries about digital store go up every year; most queries relate to content rather than bibliographic information. Content queries split into thematic, geographic, history, genealogy, newspapers.
Geneviève goes on to describe the Gallica workflow: a volume is OCR’d; some books sent straight to store; but newspapers are manually corrected by service provider, some other books are manually corrected to reach almost 100% accuracy. As a validation tool for manually corrected text, ABBYY FineReader is used. OCR is useful when words in user queries are not in bibliographic data – hence subject spread of content queries. She outlines the Wikimedia/BnF Collaborative Correction plan: going for 100% accuracy through user collaboration. Text-to-speech and epub projects in progress. Creation of groundtruthed datasets within IMPACT to aid further research into improving OCR accuracy.
Niall Anderson, BL + Mark-Oliver Fischer, BSB