Diarization and language ID in elicitation sessions

Hello all-

I want to make sure that this is not a solved problem before trying to tinker around in this area (or get some NLP people to tinker for us). I am about to go do fieldwork in which the language of elicitation will be English or Cameroonian Pidgin English, and the responses will be in a mixture of English, Cameroonian Pidgin English, and the target languages. Speakers will generally speak more than one target language. There might be code-switching within utterances, as in:

“ʒʉ́ na one, bə̀ʒʉ́ na many, sugye ant.” (plain = target, italic = Pidgin)

Similarly, I have a student who’s got hours of recordings in mixed northwestern Mandarin and Xibe (a Tungusic language), with some code switching within utterances.

If you want to auto-transcribe (using ASR) speech from an elicitation session, you need to mark off where speech is and isn’t happening, and you also need to ID particular speakers if you’re going to go on to do forced alignment on the data (as a phonetician, I definitely will). This process (diarization) is already tricky, but I imagine that it’s much harder when speakers are switching between languages. And of course for many purposes of diarization, you need to know the language being spoken in that interval (in my case, so you can select the right acoustic model for forced alignment, or use the correct data set to train an acoustic model for a particular language).

So:

  • Is there already a general solution, or general architecture, that solves the diarization issue? (For example, a script that spits out a list or array of diarization time points separated out by speaker, then language, which one could turn into a TextGrid or whatever else.)
  • How to deal with code switching within an utterance, which seems particularly hard?
  • Has any NLP work/linguistics work/joint work gestured at this as a problem (but not solved it)? I might need to do a little lit review…
  • Are there any considerations I’m missing in this discussion? (Would this be useful for other field situations beyond my own? Would there be problems for other field situations?)
2 Likes

Don’t have time for a full response atm but just want to pop in and note that diarization has received a fair amount of attention in NLP, including recently: Google Scholar. Will try to come back and say more later!

2 Likes

Well, I have some negative results to share, I guess!

My only experience in this area is messing around with Amazon Transcribe — it does have some diarization features, but the results do not look good. There are plenty of issues around using an Amazon product, of course.

I wrote about their system a bit here:

On the utility of horrible Speech to Text

I just ran a test with this random video I found on YouTube:

A bunch of screenshots from the Transcribe workflow…

Specify the job

Configure the job

Optional configuration

Now wait a while…

All done.

Download the JSON file

So then you end up with a JSON file, feel free to take a look:

update-on-trilingualism-asr.json (74.9 KB)

I either made a mistake or the quality was too low, but it looks like it’s not doing any language identification as far as I can tell (it assumed everything was French, despite my configuration saying that there was French, English, and Indonesian content).

:thinking:

And, it only seems to have identified two speakers, despite configuring it to expect five.

:man_shrugging:

You might also be interested in @nikopartanen’s comment in the topic linked above:

https://www.langdoc.net/t/on-the-utility-of-horrible-speech-to-text/681/3

Also, relevant:

https://twitter.com/cynixy/status/1520016251380244482?s=20&t=HuXmd6u1zQVtlyRVnSM-NA

(Replying in multiple posts because I’m limited as a new user to 2 links per post)

Hi @faytak — how timely! I have some answers (informed through a lot of trial and error):

  • Is there already a general solution, or general architecture, that solves the diarization issue? (For example, a script that spits out a list or array of diarization time points separated out by speaker, then language, which one could turn into a TextGrid or whatever else.)

Yes and no. Speaker diarization: Hervé Bredin’s overlap-aware re-segmentation ([2104.04045] End-to-end speaker segmentation for overlap-aware resegmentation) available as pyannote/segmentation · Hugging Face seems to be one of the SOTA for annotating when various speakers are speaking (even when over each other).

2 Likes
  • How to deal with code switching within an utterance, which seems particularly hard?

Yes, this is really hard. The most recent paper I’ve seen is by Liu et al. (paper, repo) but I haven’t tried it out myself.

2 Likes
  • Has any NLP work/linguistics work/joint work gestured at this as a problem (but not solved it)? I might need to do a little lit review…

A little self-promotion: my paper [2204.07272] Automated speech tools for helping communities process restricted-access corpora for language revival efforts may be one place to get started. We didn’t look at speaker diarization but I think adding an additional stage using pyannote/segmentation should be relatively straightforward

  • Are there any considerations I’m missing in this discussion? (Would this be useful for other field situations beyond my own? Would there be problems for other field situations?)

One thing I note from having run timed experiments for correcting the voice activity detection/language identification output is that we found that the machine-assisted workflow for VAD+SLI offered no time savings over doing it manually — because code-switched speech is very hard to segment, and annotators found it more frustrating to correct bad output than to start from scratch. All our time savings came from ASR speeding up the transcription of parts identified as English…

Edit:

Also just remembered Cécile Macaire recently reported about ASR and keyword search for two creole languages: https://hal.archives-ouvertes.fr/hal-03625303/document

2 Likes

pyannote on huggingface looks pretty excellent and I will have to look into it, thanks. any time savings compound pretty quickly during and after fieldwork so I’d be perfectly happy with good speaker detection/diarization that I need to add the language switches into manually. so much time during annotation is just spent scrolling.

1 Like