New OCR Model: olmOCR

It’s been a really exciting time lately for low-resource OCR! AI2 released a cutting-edge OCR model called olmOCR a couple days ago.

I tried it on a page from a Garo-English dictionary:

Despite numerous challenges (mixed Bengali/Latin script, word-internal dots in Latin script) I think it does a fantastic job transcribing it–here’s its outputs with no edits:

There are still problems, of course. Looks like the Bengali-script output is not very reliable—in the last line, for example, the output seems to be a very poor match for what’s in the image.

2 Likes

Thanks! I’m looking forward to trying this. I tried some (what I thought was straightforward) material with ChatGPT and it was laughably disastrous

1 Like

@lgessler — any interest in demoing this for March tinker Zoom?

1 Like

sure, would love to!

1 Like

It attempts to OCR tables, which is pretty impressive.

IXF 1. Agemmay

Isekkilen n tmaziyt d wi :

asekkil isem-is tifinay amek inţeq amedya
a a (ney : ayra) Φ a aman
b ba b bib
c ca če ch bru
d yeč d tch amcic
d da dh d ečč
e đar dadda dadda
f ilem adar adar
g fa els eler
gw gaw argaz argaz
ġ yeğ gma gma
y γar agwad agwad
h ha egğ egğ
h him iyi iyi
i i (ney : iγri) hud hud
j ja imi imi
k ka ajenjar ajenjar
rki rki

Tamasheq has 33 consonants, featuring six manners of articulation and eight places of articulation. There are no non-pulmonic consonants. The consonants are detailed in the table below.[4]:23

Labial Alveolar Palato-alveolar Velar Uvular Pharyngeal Laryngeal
plain pharyngealized
Plosive
voiceless (p) t tf k (q) (?)
voiced b d dj gi g
Fricative
voiceless f s χ (n) h
voiced z ž̛ (…)
Nasal m n
Liquid l
rhotic r
Approximant w j
2 Likes

Another OCR drop! Mistral OCR | Mistral AI