Katrien Depuydt of INL started her talk with considering the difference between a lexicon and an electronic dictionary and then covered how a computer lexicon is defined.
The OCR lexicon is
- a checked list of words in a language
- based on a corpus of dated texts
- with frequency information
- preferably from the same time period
A computer lexicon will be in a structured digital format (as in a relational xml datebase), will be primarily for computer use, with explicitly coded information re: parts of speech, syntax, etc. Used for advanced searching (to include spelling variations across history, for example) or keyword extraction.
Lexica in IMPACT
Two types of lexicon developed within IMPACT. An OCR lexicon, which is checked list of words in a language based on a collection of dated text (you don’t want to feed a 19th century text into a 16th century lexicon).It also contains frequency information for particular words in texts between particular dates – hence it is able to make judgements about how likely a word is to appear in a given era, or within a given work
The IR lexicon, by contrast, links the modern variation of a word with all its past variations, according to date. This allows for very advanced searching: a search for a modern word will produce all recognised variants through print history.
Katrien now explains how these lexica have been built, touching on the development of the IMPACT Named Entities database, which uses tagging to allow a reader or researcher to distinguish between ‘butcher’ (the job) and ‘Butcher’ (the name). Katrien shows early results from the use of these tools, showing a marked increase in the OCR accuracy of historical texts after the process of corpus enrichment. Katrien concludes with a presentation of the back end of the tool, showing how historical variants are matched over time through a process of ‘attestation’ (words presented in context to define and confirm meaning).
Katrien’s presentation is here: