In the beginning of November I’m giving a small workshop about text recognition. It’s a general introduction to the topic, but with practical orientation. The main focus will be in an approach where we specifically train or finetune a text recognition model to match some material, so that the results we get are as good as they can be – but mainly within similar type of text. I believe this is something that often works very well in the language documentation context where we have rare transcription systems that are not widely used. We’ll address both printed and handwritten materials.
Participants are more than welcome to come to the workshop with texts that they would like to process. We can together come into an estimation of how much time and effort that kind of data would currently demand. If you think some material is impossible to process, great, let’s look into it – we want to know where the boundaries of impossible are. If you are wondering more what kind of data you want to bring to the workshop, then here are few ideas to think about:
- Do you have rights to process, publish and use that data? The world is full of all kinds of materials, but it often makes most sense to work intensively with materials that we can use, share and republish. Of course, situations and motives vary and are complex, so bring in whatever is important for you and fine in your situation.
- How much similarly structured material is in that dataset? As we create text recognition models that are very specific to individual writing systems and handwritings, we benefit the most from larger consistent collections.
- Does your text contain crazy diacritics and idiosyncratic characters? Great, bring it in!
I think we will focus quite a bit to the Transkribus platform, as it can be used very easily and effectively with both handwritten and printed materials. And the current online editor is so good that that it makes everything very accessible and easy to organize. At the same time the software is not totally open source, and users have only limited amount of free credits to recognize pages, so it isn’t perfect. But what would be. It’s still an excellent tool with a vibrant user community.
Calamari is another lovely open source tool to create text recognition models, and we will discuss it briefly. I use it a lot, but all workflows I’ve come up with it are pretty technical and complicated, so I don’t know how well that fits to a workshop. But we use it to explain how these text recognitions work internally (they are all pretty similar in the end).
We will discuss a bit Tesseract, just so we are familiar with it. Tesseract has pretty good layout detection engine for printed texts, which is often useful (especially when used with Calamari that just works with lines).
One toolset we will not discuss is Larex, which is apparently really good for printed texts, but I just haven’t used it myself yet. If someone knows about it, please join us and share that knowledge!
Questions
Just post into this thread if you have some questions or would like us to focus into something specific. You can also post examples of your data here. I add here few examples from my work just to get us started.
Examples
We are getting currently about 93% accuracy with M.A. Castrén’s manuscripts. There are about 10,000 pages in this collection. This page contains Tundra Nenets, Russian and Swedish. Image taken from Manuscripta Castreniana project.
What it comes to Komi texts in Syrjänische Texte series, the OCR results are now almost perfect, and we plan to get first three books proofread during this year. The series contains five books, but the two latter ones we already have as digital files from late 90s and early 2000s (with the technical issues the readers on this site can probably imagine!). You can see that the diacritics are really nicely where they are supposed to be!