New OCR Model: olmOCR

It’s been a really exciting time lately for low-resource OCR! AI2 released a cutting-edge OCR model called olmOCR a couple days ago.

I tried it on a page from a Garo-English dictionary:

Despite numerous challenges (mixed Bengali/Latin script, word-internal dots in Latin script) I think it does a fantastic job transcribing it–here’s its outputs with no edits:

There are still problems, of course. Looks like the Bengali-script output is not very reliable—in the last line, for example, the output seems to be a very poor match for what’s in the image.

3 Likes

Thanks! I’m looking forward to trying this. I tried some (what I thought was straightforward) material with ChatGPT and it was laughably disastrous

1 Like

@lgessler — any interest in demoing this for March tinker Zoom?

1 Like

sure, would love to!

1 Like

It attempts to OCR tables, which is pretty impressive.

IXF 1. Agemmay

Isekkilen n tmaziyt d wi :

asekkil isem-is tifinay amek inţeq amedya
a a (ney : ayra) Φ a aman
b ba b bib
c ca če ch bru
d yeč d tch amcic
d da dh d ečč
e đar dadda dadda
f ilem adar adar
g fa els eler
gw gaw argaz argaz
ġ yeğ gma gma
y γar agwad agwad
h ha egğ egğ
h him iyi iyi
i i (ney : iγri) hud hud
j ja imi imi
k ka ajenjar ajenjar
rki rki

Tamasheq has 33 consonants, featuring six manners of articulation and eight places of articulation. There are no non-pulmonic consonants. The consonants are detailed in the table below.[4]:23

Labial Alveolar Palato-alveolar Velar Uvular Pharyngeal Laryngeal
plain pharyngealized
Plosive
voiceless (p) t tf k (q) (?)
voiced b d dj gi g
Fricative
voiceless f s χ (n) h
voiced z ž̛ (…)
Nasal m n
Liquid l
rhotic r
Approximant w j
3 Likes

Another OCR drop! Mistral OCR | Mistral AI

Just came across this newer OCR model, which claims to support 90+ languages: GitHub - datalab-to/surya: OCR, layout analysis, reading order, table recognition in 90+ languages . Haven’t tried it myself.

A few more OCR models have been coming out over the last few weeks. I haven’t tried any of them yet, but it might be interesting to do a larger-scale benchmark at some point in a langdoc context, doing a comparison of a selection of the models in this thread (and that could also be a cool ComputEL paper)

And some layout parsing models:

3 Likes

HF has a blog post and some inference cost comparisons for some of these models: Supercharge your OCR Pipelines with Open Models

And I found one more that looks interesting: Logics-MLLM/Logics-Parsing · Hugging Face

I found some notes on how to get DeepSeek OCR set up locally: Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

Another fairly recent model seems to be IBM’s Docling: @sungkim.bsky.social on Bluesky with a demo here: granite-docling-258M demo - a Hugging Face Space by ibm-granite

3 Likes

medieval-data (Medieval Data) also has a full page OCR approach which seems to work reasonably well for Latin at least

1 Like