There is a long-running debate in documentation about whether to ākeep everythingā or not. A prime candidate for ānotā are the recordings you might make when working directly with a speaker in open-ended elicitation or rehearing and stuff like that. The recordings can be long, and thereās no way youāre going to transcribe the whole thing. Theyāre pretty hard to use.
Iām sure other people have tried this, but I thought Iād bring up the topic: what about running those recordings through automatic speech recognition (ASR)? I did that with some old recordings today and it was pretty interesting to see what came out ā they were one-on-one recordings, just me and one speaker, and of course the stuff that wasnāt English came out as garbage.
But the English side is quite usable, from both of us. For questions like ādagnabbit, I know we talked about āoystersā at some point, and thatās when we came across that good verbā¦ā, then ASR can be a lifesaver.
In the case of Amazon Transcribe, it outputs a JSON file with timestamps down to the word level, which could, with a little massaging, be plonked into ELAN or something for further work.
I think ultimately there might be better ways to manage the acquisition of field data in the first place (what if the annotations you wrote down during the recording were automatically timestamped?), but itās probably true that we all have monolithic, more or less opaque recordings of this sort lying about, that weād like to do something with.
this sounds like a great workflow for people who work in English or other ASRable languages! (alas my recordings all pretty much only contain Syuba and Nepali)
For longer elicitation sessions Iād try to listen back to them through ELAN and mark up targeted examples that Iād make notes of. This let me confirm my notes (or change things), and also gave me a structured anchor to the elicitation if I came back to it.
Itās funny how I never thought to tell anyone about workflow choices like this before the forum made me realise they were actually choices I made!
Yeah, this bit is a drawback. With time, etc etc. Actually, in my Hiligaynon case, I was thinking that maybe I should have told the recognizer that the content was in English plus some other language with a similar phonology, maybe Indonesian, to see if that would have improved anything. (Tagalog probably would have, but Amazon doesnāt support that either.)
Right? I think this is actually reflective of a big problem in our field. Because we (āre forced to) rely on tools that have are only designed for a subset of the kinds of things we actually do, thereās a lot of knowledge that we pass around kind of āguild-styleā. Itās hard to externalize (let alone standardize or institutionalize) workflows that we really use to get stuff done.
I would argue for ārecord everything and keep everythingā ā storage these days is so cheap and you never know what information is going to be of interest to others, including the community, now and into the future. A late colleague who was working in Australia in the 1970s was told by her supervisor to only record texts (story telling) and NOT the translations of them or elicitation sessions ā she lived to regret that decision, and indeed it is a decision, as she had no way to go back and check discussion about meanings or contexts of use etc., let alone her own misunderstandings at the time which she later realised after doing further analysis. There is also lots of discussion in the contact language (English, Spanish, Nepali, Tagalog) etc. that may be of interest to people wanting to study contact varieties, or to the community and others for the content, rather than the linguistic form.
I for one agree with that⦠Iāve never seen how it makes sense to delete anything, really. Itās not too much of a burden to put some basic metadata in place, or better, some time-stamped notes of the sort that @laureng mentions, and then stash it somewhere (well, a few redundant somewheres!).
It seems likely that the generality of technologies like ASR will only improve in the future, after all: more languages, better accuracy. Who knows what kinds of uses will become much easier down the road?
Besides keeping everything, think about how to preserve it. My current sigline in emails is " āDigital objects last foreverāor five years, whichever comes firstā (which I owe to Jeff Rothenberg). Seriously, no digital mediaānot hard drives, not SSDs, not tapes, not CDs or DVDs (especially the kind you write on your computer, as opposed to the ones you buy with pre-recorded content)ālasts forever. Most of those have a shelf life of a decade or two.
Your best bet is to find an archive repository. That way your data can (hopefully!) outlast you. Although I understand they can be hard to find if your data doesnāt come from the right part of the world, and they may have rules about required metadata. (Usually those rules are reasonable, although Iāve heard of exceptions.)
I had this idea recently! Specifically, Iāve been trying to use Googleās speech-to-text API, with the idea that it would at least make my elicitation & translation sessions (I opt to record almost everything) searchable. Itās been a bit challenging with my limited coding knowledge, especially to get something resembling an SRT file or suchlike, but Iām optimistic it would make life a lot easier.
Hi @mayhplumb, welcome! Please feel free to get into nuts and bolts of your process of you like, there are lots of fellow code critters about.
I used Amazonās ASR, which gives quite usable output on the form of a JSON document.
The system can also do some pretty interesting things, like try to identify speakers for instance, and it also gives timestamps down to the word level.
Iāve never tried to do this myself, but itās also possible to expand the recognition vocabulary before running the recognition algorithm.
Thanks for this interesting post, Pat. This perhaps not exactly what you were thinking of, but there have been a few initiatives to use ASR or other AI approaches to addressing the transcription bottleneck.
Iāve only used Prosodylab-Analyzer for forced alignment myself. You can start with forced alignment to create training data for an ASR system, though.
ELPIS has been in development for a while now and I really hope they finish it soon. They switched from using Kaldi to ESPnet, so perhaps that is why it is taking a bit longer. My understanding is that LD practitioners could be integrating ASR tools into their workflow but much of the general purpose ASR software is challenging to use, and that is the gap that ELPIS is designed to fill.
Has anyone used trint for this (or other tasks)? I ask because my university now has a site license. But itās hard to know what languages they support
I think many know from experience already but I did some timed tasks with some RAs and the ātranscription adminā (identifying speech/non-speech and speakers) took as much if not more time than English transcription (Table 4: https://arxiv.org/pdf/2204.07272.pdf), so if the service gives you good enough speech activity detection/speaker diarisation as part of the ASR service, it might be worth looking into even if the transcriptions are all gibberish and you throw them awayā¦
@cbowern ā Not trint but I was recently talking to Ruth Singer and she says sheās been using Descript https://www.descript.com/ to speaker diarization and English transcription. She sent me an output file (.srt) which I was relatively easily able to wrangle into a tab-separated file in R for import into ELAN:
@laureng ā itās not quite self serve-able like Trint/Descript/Otter/etc. (perhaps it could be with Elpis?) but Nepali is apparently ASRable (as of late last year). Thereās a set of openly released ASR models for Indic languages IndicWav2Vec | AI4Bharat IndicNLP with a 9-11% error rate (Table 3: https://arxiv.org/pdf/2111.03945.pdf).
Iāve been using Descript for a few months now to help transcribe some interviews that are mainly in English and as Claire shows it does export to Elan quite well, even preserving speaker separation. However, in looking at our transcripts in Elan now we have discovered a bit of a major problem, the timestamps get slightly out of alignment the further you look in time at each transcript. This is because sometimes when you edit the automated transcript in Descript by cutting out an entire word it cuts some time out of the media file (which you import into Descript). We didnāt realise this though. So each of our 1 hour interviews has about 10 of these little accidental āeditsā and each media file in Descript is a little shorter than the original media file. It doesnāt seem possible to turn this 'feature (?bug) off. Iām thinking now of going with another app. I emailed support at Otter.ai and this doesnāt seem like it can happen in Otter.ai - the original media file is never altered. So be wary of apps like Descript that are marketed as a full package for podcasters - a great way of editing media via transcripts! Not so good for us. Sonix looks like it might have the same problem as Descript from what I can garner from their help material.
Itās pretty bonkers that Descript actually edits the audio from the transcription⦠itās must have been intentional but dang, certainly qualifies as a bug in my book.