OCR Workshop for Documentary Linguists

nikopartanen · October 21, 2021, 5:06pm

In the beginning of November I’m giving a small workshop about text recognition. It’s a general introduction to the topic, but with practical orientation. The main focus will be in an approach where we specifically train or finetune a text recognition model to match some material, so that the results we get are as good as they can be – but mainly within similar type of text. I believe this is something that often works very well in the language documentation context where we have rare transcription systems that are not widely used. We’ll address both printed and handwritten materials.

Participants are more than welcome to come to the workshop with texts that they would like to process. We can together come into an estimation of how much time and effort that kind of data would currently demand. If you think some material is impossible to process, great, let’s look into it – we want to know where the boundaries of impossible are. If you are wondering more what kind of data you want to bring to the workshop, then here are few ideas to think about:

Do you have rights to process, publish and use that data? The world is full of all kinds of materials, but it often makes most sense to work intensively with materials that we can use, share and republish. Of course, situations and motives vary and are complex, so bring in whatever is important for you and fine in your situation.
How much similarly structured material is in that dataset? As we create text recognition models that are very specific to individual writing systems and handwritings, we benefit the most from larger consistent collections.
Does your text contain crazy diacritics and idiosyncratic characters? Great, bring it in!

I think we will focus quite a bit to the Transkribus platform, as it can be used very easily and effectively with both handwritten and printed materials. And the current online editor is so good that that it makes everything very accessible and easy to organize. At the same time the software is not totally open source, and users have only limited amount of free credits to recognize pages, so it isn’t perfect. But what would be. It’s still an excellent tool with a vibrant user community.

Calamari is another lovely open source tool to create text recognition models, and we will discuss it briefly. I use it a lot, but all workflows I’ve come up with it are pretty technical and complicated, so I don’t know how well that fits to a workshop. But we use it to explain how these text recognitions work internally (they are all pretty similar in the end).

We will discuss a bit Tesseract, just so we are familiar with it. Tesseract has pretty good layout detection engine for printed texts, which is often useful (especially when used with Calamari that just works with lines).

One toolset we will not discuss is Larex, which is apparently really good for printed texts, but I just haven’t used it myself yet. If someone knows about it, please join us and share that knowledge!

Questions

Just post into this thread if you have some questions or would like us to focus into something specific. You can also post examples of your data here. I add here few examples from my work just to get us started.

Examples

We are getting currently about 93% accuracy with M.A. Castrén’s manuscripts. There are about 10,000 pages in this collection. This page contains Tundra Nenets, Russian and Swedish. Image taken from Manuscripta Castreniana project.

What it comes to Komi texts in Syrjänische Texte series, the OCR results are now almost perfect, and we plan to get first three books proofread during this year. The series contains five books, but the two latter ones we already have as digital files from late 90s and early 2000s (with the technical issues the readers on this site can probably imagine!). You can see that the diacritics are really nicely where they are supposed to be!

JROSESLA · October 21, 2021, 6:31pm

Excellent! My RA and I have been working with Transkribus (with pretty good results, I’d say, in spite of the challenges of the material) so I/we will try to join!

I feel kind of silly but how do we join? Is there a way to register?

pathall · October 21, 2021, 6:44pm

Hi @JROSESLA! Traditionally we have done meetups in Gather.town, but @nikopartanen, if you prefer Zoom or some other tool we can do that. (In other words, Jorge, we forgot to decide )

JROSESLA · October 21, 2021, 6:58pm

Ah good; it wasn’t just that I was missing something by skimming too quickly. As it turns out, my RA has class at that day/time and so do I so I said I’d ask you guys if you were willing to record the session.

nikopartanen · October 21, 2021, 7:12pm

I was thinking gathertown was pretty good the last time I was around, but also Zoom works. I’m fine with recording, although if people want to discuss and present materials they don’t yet want to make more public not recording would also be ok. We are all approachable people too, I think, so your team could also be in touch some other time if this doesn’t work.

pathall · October 21, 2021, 7:24pm

@nqemlen Hey Nick, maybe we could see how Transkribus does with Villegas etc.

nqemlen · October 22, 2021, 2:23am

Great idea, Pat! Let’s go for it.

pathall · November 1, 2021, 3:33pm

I’m going to go ahead and announce the Gather.town link if that’s okay:

https://gather.town/app/OY6JfpUCA03l5Mn4/docling-ocr-workshop

I’ll post this on Twitter too. Looking forward to seeing everyone!

hp3 · May 3, 2023, 9:01pm

Was this recorded by chance? I’m looking for a python based workflow for journal articles with tone and IPA as used in many african contexts.

pathall · May 7, 2023, 11:39pm

I’m afraid it wasn’t

nikopartanen · May 8, 2023, 10:42am

Hi Hugh, @hp3! You can contact me any time if you have questions. Some things have also improved in a few years, journal articles with tones and IPA sounds very doable nowadays. I would probably just upload everything to Transkribus and then run the model Transkribus Print M1 into maybe ten pages (trying to pick those that contain as many difficult characters as possible). Then you proofread those and train a new model. One thing you can also do is to select pages in the ends of the articles that have just a single paragraph or so, as those are very fast to proofread and you can increase the number of pages that is in the process.

Transkribus is not free nowadays, but all new users get 500 free credits if I remember correctly, and OCR of one printed page is 0.17 credits, meaning you can do a pretty high page number without paying anything.