Workflow thoughts on digitizing historical Tunen texts

ejk · June 6, 2022, 6:33pm

I thought @ejk’s post merited its own topic, so we moved it here from Anybody else working with historical documents? — @pat

I am interested in what workflows/tools people are using here. I am using some texts from 1975 that are not digitised, which look like this:

Ideally my workflow would be OCR → manual corrections → interlinearisation/alignment, but the data isn’t in a good format for OCR for the following reasons:

Metalanguage (French) word glosses are on alternate lines with the target language (Tunen), meaning non-contiguous text sections
7-vowel system with tone marking so some OCR models are less good (though I believe this has improved in the last few years)
Non-Unicode characters e.g. a small vertical line above a character (to indicate a mid tone)
Footnotes
(My copy has pencil annotations and the pages aren’t bright white)

I was therefore advised to not even bother trying OCR/automatic methods. I would be curious if people have had success on semi-automatising such processes on similar material in order to mitigate the digitisation bottleneck. @pathall, did you type out your transcriptions manually?

cbowern · June 7, 2022, 2:44pm

I worked on material a bit like that for Titan (published here: Project MUSE - Sivisa Titan); I tried OCR but it ended up being faster to retype, and I learned the language better through retyping it. That was in 1998 though, so maybe OCR is better now?

pathall · June 8, 2022, 9:12pm

Hi @ejk!

It wasn’t me, it was @nqemlen who did the transcription. We did a lot of work on designing various kinds of tools for transcribing the corpus, but unfortunately those didn’t get into a working state.

What we ended up doing as a first step was for @nqemlen to transcribe into FLEx, and then I wrote a parser that took that output and then generated a presentation format. So I guess that would be one option.

I do think it’s at least worth brainstorming a bit about possible interfaces for transcribing this format, though… especially if the book of texts is a sizeable one, any gains in transcription efficiency could multiply.

One thing to think about: suppose you had perfect OCR. Obviously, this is unlikely, but just as a thought experiment. There will still be the problem that you mention in your first bullet point: the text has an internal structure, and that structure is mostly conveyed “visually”. It might be worth building a custom tool to help you process OCR, even if that OCR is not great, but it depends on what you want to do:

Do you want to keep the translator’s “phrasal” glosses?

Mòndo ɔmoteˀ níaya ninyə̀ á Hɛ́kɔl a nákan o mòkend.
Homme un son nom Hɛkɔl (le rat palmiste) il alla en voyage.

So obviously this isn’t going to line up word-to-word (I mean, you know this ):

Tunen	French
Mòndo	Homme
ɔmoteˀ	un
níaya	son
ninyə̀	nom
á	Hɛkɔl
Hɛ́kɔl	(le
a	rat
nákan	palmiste)
o	il
mòkend.	alla

At least the parenthesized bit is… otherwise. (Maybe an explanation of the name Hɛ́kɔl?)

Tunen	French
Mòndo	Homme
ɔmoteˀ	un
níaya	son
ninyə̀	nom
á	what am i doing
Hɛ́kɔl	Hɛkɔl (le rat palmiste)
a	il
nákan	alla
o	en
mòkend.	voyage.

Or like, I dunno, whatever it is! But a key question is, would you want to bother with recovering those French glosses from OCR, or are you planning to re-gloss it yourself (Leipzig-style or something?)

The reason I ask is, the French OCR isn’t horrible, and for that part post-editing might be worth the effort, but there is also the alignment problem, and that might also be worth thinking about as a custom interface.

Also, can we just pause to appreciate how adorable a rat palmiste is?

rat palmiste

pathall · June 9, 2022, 12:55am

Found myself thinking a bit more about this format… so each story is part of a spread that spans (at least part of one) verso page and recto page.

V0. Title (of the whole book?)
V1. Story title
V2. Free translation (French)
V3. Moral (this doesn’t seem to occur on the Tunen side?)
V4. Footnotes - morphological analysis of some forms in the interlinear on the other side

R0. Title (same as left)
R1. Story title (same as left)
R2. Metadata: Speaker name? (Conteur)
R3. Interlinear

In particular, it’s interesting to compare V2 and the French lines in R3. So for instance, we have the French free translation:

Un homme nommé Hɛkɔl (le rat palmiste), alla en voyage.

against the gloss line of “meta-French”, which reads:

Homme un son nom Hɛkɔl (le rat palmiste) il alla en voyage.

So there’s almost all the information for a “standard”-ish interlinear gloss here. The base line (or “transcription” line) is the bold content on the right, with sentence punctuation. But that line is doing double duty as the word-level transcription of word forms (although there are no morphological boundaries, and we know there that this is no isolating language — not just because it’s Bantu, but because the footnotes of some forms have morphological analysis.

ejk · June 21, 2022, 12:39pm

Thank you @cbowern and @pathall! Pat, I think the morphological analysis/glossing would be easier to get elsewhere from Dative/FLEx / do manually, especially as the texts don’t do morphological-level/Leipzig glossing and that would be the format I’m going for. So that would mean I would want to find an OCR model that can look at the ‘recto’ page, ignore the French, and only scan the text in bold font (i.e. the Tunen lines). I don’t know if such software exists…?

It could also be worth just OCR-ing the French part (‘verso’) to speed things up, and manually typing the Tunen. If I used your workflow of FLEx (/Dative) for the Tunen transcription, I could also gloss it manually to train up a parser at the same time. There is also a dictionary, which, if digitised, could be used for automatised word-level translation. It’s funny you ran into issues with á, because that’s one of the forms with multiple different underlying forms (which I could correct manually if the automatic one is wrong). Thanks for the ideas!

Re: rat palmiste, it’s indeed a story personnifying the squirrel

pathall · June 21, 2022, 3:57pm

Out of curiousity, how long is the whole book of texts?

For the possibilities of OCR, I think there’s no one better to ask than local hero @nikopartanen.

hp3 · June 24, 2022, 6:37am

Nick T. and company gave a presentation and PARADESIC@100 on their work which looked at OCR and interlinear text. I also know that a project in Finland looked at OCR for minority languages. They had thousands of pages. Their team gave a presentation at Computel-2 or Computel-3, I can’t remember exactly.

ejk · July 22, 2022, 2:24pm

(Saw there’s useful info about this topic on @faytak’s thread! Low-resource language OCR with special characters)

pathall · July 22, 2022, 3:05pm

I think you might be thinking of @nikopartanen’s work?

pathall · July 22, 2022, 3:06pm

Yes indeed! There is a lot of interesting on this topic here; I wonder if we should start an OCR category…

hp3 · July 22, 2022, 4:59pm

Yes Exactly!! We met at an ICLDC and said that we needed to have better description for Corpora, including character inventories and text encoding format (ASCII, UTF-8, etc…)

cbowern · December 14, 2022, 12:33am

https://scholarspace.manoa.hawaii.edu/items/84406f5c-9ae6-4b05-9c2d-f0e155024629 as part of this project

pathall · January 24, 2023, 5:41pm

Curious to know if this project is ongoing, @ejk?