While evaluating OCR, some users demand that diacritics should not be considered an error since they are often neglected for information retrieval (search engines). This request raises some doubts about the expected behavior. What do you think about the following issues?
In case the option “Ignore diacritics” is selected, the expected behavior of the evaluation code should be:
- Missing punctuation shall not be considered an error: “give ’em a chance” = “give em a chance”.
- Inserted punctuation should be considered an error: “boys” is not “boy’s”.
- Space shall not be taken into account if it follows an omitted punctuation already preceded by space: “boys & girls”=”boys girls”.
- One should not take into account added space when it replaces a diacritic : “I’m” = “I m”.
- One should not take into account missing space when following an omitted punctuation char: “I’ m” = “Im” (but then “Jess’ father” = “Jessfather”)
- One should not take into account missing space when preceding an omitted punctuation char: “His ‘n’ Her'” = “Hisn Her”
- Allow a single space to be inserted after punctuation: “I’m” = “I’ am”, “don’t = don’ t”
- Allow a single space to be inserted before punctuation “don’t” = “don ‘t”