DoReCo is now live

:+1:

Check this out, y’all! DoReCo is now live, with EAFs and audio for 51 languages! Very cool.

The DoReCo database contains corpora on 51 languages from 32 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR.

In total, DoReCo contains over 100 hours of recordings with transcriptions that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 words for isolating languages. The total number of core texts is 893, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages.

For 38 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 38 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 tokens of time-aligned interlinear glossed text and another 300,000 tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.

Currently cramming my harddrive… I’ll update this post with more thoughts soon. Also welcome any comments by anyone who contributed to the project or is thinking about working with the collections in some way.

Congrats to local heroes @rgriscom for his work on Asimjeeg Datooga and @Andrew_Harvey for his work on Gorwaa! (Did I miss anyone else here?)

4 Likes

Yes!! Been waiting for this for so long!

1 Like