Low-resource language OCR with special characters

faytak · July 21, 2022, 4:57pm

Hi all-

Wondering if anyone has had success in getting OCR to work at all for low-resource languages with numerous special characters. During my fieldwork last month, I came into possession of a Kom-language New Testament, which is assuredly the longest written text in the language at the moment. I’d love to have an OCR version of this, both as a useful community product and as a means to get phoneme/grapheme/n-gram/word frequencies as a foundation for later research. However, existing “easy” OCR solutions are not going to cut it - the following special characters are heavily used and aren’t well recognized.

Angma (ŋ)
Several unusual vowel symbols (æ œ ɨ)
Low and falling tone marks (ò ô)
Glottal stop is marked as an apostrophe (')

I assume there will be hand-correction (and indeed it’ll need to happen, since there’s a lot of verse numbers, footnote markers, etc) but I’d like to avoid having to manually insert every single <ŋ>, for starters. So, something which is not language-specific, or which can be trained to recognize certain unusual symbols, is probably going to be needed.

A typical page is included below in case anyone wants to give it a spin with their preferred OCR solutions and report back.

lgessler · July 21, 2022, 5:02pm

Not too familiar with this space but I know Rijhwani et al. did some great work on this a year ago: OCR for Endangered Language Texts

Their code’s publicly available and looks pretty well documented for research code: GitHub - shrutirij/ocr-post-correction

pathall · July 21, 2022, 8:23pm

Huh yeah, even looks like they’re eliciting projects:

Which links to this form.

(& Oh hey, at least one local hero is involved with this: @Daisy. Also sure @nikopartanen would have a lot to say…)

nikopartanen · July 21, 2022, 8:25pm

I would recommend trying Transkribus. You detect the layout and then run first the basic printed character model, then go manually fixing i.e. two or three pages. Then you train a model in Transkribus and it should be pretty perfect after some more pages, but already with a few pages you can get it pretty far. The characters in this kind of data are extremely easy to recognize, the models that come out of the box just don’t have them.

You can also check tools like Calamari, and Larex is apparently also very good. Transkribus is not entirely free, but new Transkribus users get free 500 credits, and you can get more for research purposes, so it’s easy to try at least.

faytak · July 21, 2022, 8:40pm

Thanks for reminding me of this (I dimly recall seeing this now). I dropped them a line at the link Pat pointed out.

faytak · July 21, 2022, 8:43pm

I think the missing step for me is just how to do the training more generally. I gave the Transkribus demo a spin and it seemed to perform worse than a couple of the other free tools on a first pass. But obviously it’s trainability that matters more, and this clarifies the workflow rather nicely - first-pass, fix 5 pages, run “corrector” pass trained on fixed pages?

nikopartanen · July 22, 2022, 6:21am

I think the training tools are not activated by default, so you need to write to info@readcoop.eu and ask them to activate those for your user account. This is how it worked before I think. Then these instructions apply:

If you train what they call HTR+ model, then it needs at least one page for validation. This is just used to measure the accuracy, so I recommend putting there a page with fewest lines so you get something there, but most of the material is helping in the training.

You should get very easily almost perfect result, the uppercase forms of rare characters will, however, remain quite long problematic as one needs quite much proofread material before the model has met all of those in training.