📜 Anybody else working with historical documents?

nqemlen · April 19, 2020, 8:12am

Is anyone else working with linguistic data from historical documents? If so, it would be great to learn about your experiences. Here’s a page from the Aymara text, from 1612, that @pathall and I are working on. The left column is in Aymara, and the right column is a Spanish translation of the Aymara:

So there are several steps:

transcribing the text
putting the 17th century Aymara spelling conventions into a normalized, modern orthography
doing an interlinear analysis of the Aymara
aligning the Aymara sentences with the corresponding Spanish sentences
giving free translations of both

It’s a lot of work, but it’s a very rich text. If anyone wants to use Docling for similar purposes, let me know! Maybe we can develop some shared tools.
Nick

pathall · April 19, 2020, 9:10am

Hi @nqemlen!

Nick’s introduced me to the Aymara materials that he has been working with for some time before I got involved. If only every language had this much bilingual material available.

Nick knows it much better than I do, but the corpus consists of at least these things:

A nearly 1000-page Spanish to Aymara (to Spanish!) translation of a relgious book
Two dictionaries, Spanish - Aymara and Aymara - Spanish
A grammar

So that’s a whole Boasian trilogy right there. It’s a really amazing corpus.

nqemlen · April 19, 2020, 9:31am

It sure is! But totally inaccessible without the kinds of tools you’re developing.

pathall · April 19, 2020, 9:40am

And the hard work of linguists and other language workers like you who are doing the actual work of transcription!

My own software, such as it is, is inspired in large part by precedents that already exist: ELAN, Flex, Toolbox, even Praat. (Mad props to all of those.) My own dream is to bring basic functionality from the varied domains of documentary linguistics (texts, media, lexicography…) into a single platform (the web), and to encourage all us linguist folks to talk about what we want our software to do going forward. Really hopeful that this forum can be a place where experts in particular languages (like you!) can participate in such a discussion.

pathall · April 19, 2020, 1:56pm

Back to the historical documents topic, I once worked on the so-called “Kostromitinov Vocabulary” from 1833 (never published it, didn’t even finish the paper!):

Here’s the whole thing:

http://ruphus.com/kostromitinov/document.html

Here’s a rather preliminary web version of a transcription I did:

And the rest of that:

http://ruphus.com/kostromitinov/transcription.html

It’s an interesting story. In the early 19th century, there was a Russian output in California (it’s still there, a park now) called Fort Ross. This was (and is) in Kashaya Pomo territory. Further south the brutality of the Spanish (Mexican) missions led to various peoples fleeing north, and some ended up at Fort Ross, particularly the Bodega Miwok, but many others as well. (There were also many native Alaskan peoples at Ross, who had come down with the Russians.)

Unsurprisingly, then, the document includes several languages: German (it was published in Germany),
Russian, Kashaya Pomo, and Bodega Miwok. All but the German entries are in a Cyrillic writing system, so half the work consisted of transcribing the original orthographies. The Russian transcription is of its time: ѣ’s and Ѣ’s abound, so I modernized those (since I know not much about Russian and had to look things up in modern dictionaries). I didn’t do much with the Bodega Miwok, except try to transcribe it.

The Kashaya was most of the work, and it was mostly a matching game; trying to figure out how the Cyrillic transcriptions mapped onto the late Robert Oswalt’s materials and orthography.

My work on this is 6 or 7 years old now, but if I redid it now i would probably do it differently. Even so, the data isn’t in too bad of a state (there’s a JSON file). The quality of the content in this old document is pretty amazing, and it’s pretty rare in California to have material that old at all.

clriley · April 19, 2020, 4:06pm

Hi Nick and Pat,
I have been working on two sets of historical documents that may be of interest. One is a set of 300 notebook pages from a Sierra Leonean goldsmith who was active in the 1950’s; he wrote in the Mende Kikakui script. Tools I’m starting to use for that include Mirador and a customized Unicode input application. Another is a 1913 diary of Boima Kiakpomgbo kept in the Vai script that runs for 180 pages.

Charles

aventayolboada · April 19, 2020, 6:25pm

Hi all!
I’ve recently started working with some of Jochelson’s Yukaghir legacy materials from the late 19th century. There’s a collection of 100+ texts (in Cyrillic and some sort of Roman transliteration), a grammar sketch and vocabulary list. I’m trying to develop a corpus with these materials and contemporary texts, and hopefully my own fieldwork.

I have a question for you @pathall, since you’ve mentioned JSON. I’m quite new to this; I’ve been mostly going with XML for each text (following the BNC structure). Would you recommend a different encoding?

Albert

nqemlen · April 19, 2020, 7:19pm

Wow Pat, that’s a fascinating history, and a rich set of data. Looking forward to hearing more about it!

nqemlen · April 19, 2020, 7:21pm

Hi Charles,
Nice to be back in touch! Those sound like really interesting corpora. Are these scripts widely used, or were they in the past?
Nick

clriley · April 19, 2020, 9:22pm

Both are now past their period of heaviest use, but there is still some interest and limited expertise to be found. In the Mende case, we’re running into some gaps and unknowns that the Unicode proposal as approved did not cover. The Vai has largely been translated on a first rough pass, but will be under closer examination now to prepare it for publication.

Charles

nqemlen · April 20, 2020, 6:49pm

Sounds fascinating, Charles! Can’t wait to learn more about it.
Nick

pathall · June 4, 2022, 3:46pm

Don’t know if you’re around, @aventayolboada, but I’m curious to know what became of this project — already almost two years!

pathall · June 21, 2022, 3:51pm

5 posts were split to a new topic: Digitizing Tunen texts

hp3 · June 21, 2022, 8:03pm

There has been a lot of work in coeur d’alene and in Oregon on various older manuscripts in various orthographies in indigenous north american languages. There has also been a lot of work on Mixtec language manuscripts from the time around the invasion of Mexico by the Spanish. These might be of interest if you are looking for CMS and social technical systems for collaborative transcription.