It’s been great to be involved in this project! I think this kind of collaboration between people who want to use language data (in this case DoReCo), and people who collect that data (in this case me) is going to be happening more and more in the future, and not only do I want to continue participating in it, but also to encourage other people doing language documentation to do the same.
Here’s a brief snapshot of what that looked like for me:
-In early March of 2019, @rgriscom sent me an email saying that a colleague of his was joining a project and that they were looking for samples of language that met the following criteria:
-
a minimum of 10,000 transcribed words (typically distributed over various recording sessions/annotation files)
-
translation into a major language
-
primarily monological texts (e.g., personal or traditional narratives)
-
time-alignment of transcription and translation with audio files at the level of sentences, paragraphs, utterance, or intonation units (i.e., “annotation units” in ELAN, time stamps in Toolbox records)
-
audio is of reasonable quality (not too much overlapping speech or background noise)
-
transcription/translation/annotation files (not audio/video files) can be made accessible within three years on the DoReCo platform under a Creative Commons Attribution 4.0 (CC BY 4.0) license, with strict rules for fair scientific use (see below)
…as well as to indicate if the data includes at least 10,000 words that are additionally morphologically annotated (typically using Toolbox/Shoebox) with (i) morpheme segmentation, (ii) morpheme glosses, and (optionally) (iii) part-of-speech tags.
There was an additional understanding that following this initial ‘donation’, I would, over the course of the project, provide:
-
A chart specifying correspondences between the orthographic characters used in the transcription and IPA symbols
-
Answering our questions regarding e.g. inconsistencies between the audio and the transcription (e.g. transcribing/glossing elements that are not transcribed)
-
Providing basic metadata per recording session if not already available (e.g. anonymized speaker codes, speaker sex and approximate age)
-once I had reached out to the DoReCo contact email, as well as sent them a short example of a parsed and glossed text, I received a response very quickly (around 1 week later), saying that my material looked like a good fit. I was then asked to specify a subset of recordings from the archived collection of Gorwaa materials which were similar to the sample I had provided
-around early October 2019 (and after some short exchanges back and forth regarding small stuff like missing files etc.) I was told that my data had been processed with the MAUS software and given automatic word-level alignments. I was then asked if I would like to take over post-processing of this material (things like identifying code-switching, filled pauses, missing transcriptions, etc.). I was also told that there was a small amount of funding available for this, which would cover any time I might have to put into it.
-I originally thought that I would be able to do this, but it turned out that I just didn’t have the capacity. For several reasons (primarily Covid and being separated from the computer I usually use to process my files for several months as a result of pandemic travel restrictions), I ended up making this decision around 12 months later (October 2020). Amazingly, DoReCo was still able to work with my data, and employed a plan B to have the material post-processed by an assistant on their side.
-In September (2020) DoReCo got back in touch, at which point my material had been successfully post-processed by their assistant. I was asked to review the ELAR files to make sure what they’d done was an accurate representation of the recordings, and was given several specific questions to respond to (things I might not have transcribed but were clearly in the recording, identifying typos, etc. This was really straightforward, and required just a few hours’ work on my end. By mid-November (2020), I had responded to everything.
-Since that time, the Gorwaa materials processed by DoReCo have been used in a recent (2021) paper (DOI here; open access here), with my archive deposit as the citation for the data used therein.
And now for some reflections:
-Having the Gorwaa data used as part of the DoReCo project involved back-and-forth between the project team and me and went FAR beyond me just having my materials openly accessible online and DoReCo downloading them for use. This will probably be the case for most re-use of archived materials, and should be understood by those of us who want our materials to be reused. It’s not a passive process for the language documenter, and we should be aware of (and prepared for) this
-DoReCo made things explicit from the very beginning (criteria for what kinds of data they were looking for, expectations of contributions from the documenter, etc.). This was crucial for my participation because I knew from the start what kind of time I would have to put into this from the start
-It should also be noted that DoReCo explicitly sought lesser-documented (or lesser-supported) languages for its sample. This approach should be recognised and acknowledged as an important step in the right direction.