Is there a name for what we're talking about?

pathall · July 1, 2020, 8:43pm

I wonder if we need a new label.

Among the many interesting observations in recent posts here, there have been mentions of whether and how big-data style approaches could be applied in language documentation: those mentions include machine learning, NLP, computational linguistics, and so forth.

I’m all for the exploration and application of techniques of that ilk. But I also find myself thinking about how essentially all of those approaches to working with language really can’t get out of the gate without a sizeable corpus. It would be awesome if there were part-of-speech taggers and speech-to-text and text-to-speech and automated glossing and all that stuff for every language on the planet. But the fact of the matter is that for the vast majority of languages — and certainly for the vast majority of languages that documentary linguists and their colleagues work on — there just isn’t enough data to bootstrap such systems, yet.

Maybe I’m misunderstanding what it takes to get going with numerically-oriented techniques. I know that certain NLP tasks (for instance) can get to high accuracy with a fairly small corpus. (Part-of-speech tagging is often put forth as the symbol of tasks for which high-accuracy is fairly easily attainable.)

But isn’t it an unavoidable fact that we are not beyond simple data entry? Put bluntly, we’re going to have to type. A lot. Even if we have help with glossing — and even if Zipf’s law is on our side, reasonable documentation of a language involves typing thousands and thousands of unique sequences of characters.

To my mind, this boils down to the fact that we face, primarily, a user interface problem. How do we make that difficult tasks somewhat less difficult? Well, by thinking in terms of user interfaces. By thinking — a lot — about design. About how linguists actually get their work done. About studying data lifecycles.

But this kind of stuff is traditionally not really in the linguist’s wheelhouse. Ling 101 does not help you learn to decide whether a dropdown or a checkbox is going to work better in your documentary task. And yet, I think, we do need to learn about that. We need to talk and share about how to convert what we do into designing tools to help us do it.

So what is that process called?

“Documentation design”?

“Digital documentation”?

lgessler · July 2, 2020, 10:10pm

It would be awesome if there were part-of-speech taggers and speech-to-text and text-to-speech and automated glossing and all that stuff for every language on the planet. But the fact of the matter is that for the vast majority of languages — and certainly for the vast majority of languages that documentary linguists and their colleagues work on — there just isn’t enough data to bootstrap such systems, yet.

I think this is pretty much right for now, but also that for some tasks, there’s been swift progress on getting “okay” results (i.e., good enough that getting automated output and correcting errors is quicker than annotating from scratch) with minimal training data. The people behind ELPIS, for instance, showed that you can sometimes get OK speech recognition with only a few hours of transcribed speech: in Table 1 from their paper (below) you can see that with only an hour of training data some languages were able to get every 6 out of 10 words correctly recognized (Warlpiri), and with 2 hours, the model was able to get every 9 out of 10 words correct for Abui. (See also Neubig et al.'s ComputEL-3 paper.)

To my mind, this boils down to the fact that we face, primarily, a user interface problem. How do we make that difficult tasks somewhat less difficult? Well, by thinking in terms of user interfaces. By thinking — a lot — about design. About how linguists actually get their work done. About studying data lifecycles.

Absolutely. Whenever this topic has come up I’ve always droned on saying that while all of this is cool, it’s practically useless until it’s integrated into an interface that, concisely, people actually want to use. There has been work at least as far back as the past 20 years on the specific problem of how to use methods from NLP to assist in language documentation, and yet documentation people have still benefited from it not at all.

Perhaps unfortunately, linguists generally don’t see the work that goes into creating a platform for language documentation as intellectually meritorious, and while I absolutely get what people that take this position are getting at (doing this work doesn’t involve the same kind of thought that goes into theory or description), it seems absurd to me that the end result is that a fundamentally important activity in our discipline, collecting and describing data, has not gotten to benefit from advances in computing because the prevailing attitudes people have about the worth of this kind of work has made it about as unattractive for advancing your career as you could imagine.

I think the culture of our field could be different, too: I have a friend who is a PhD student in astrophysics, and some of her most successful publications had “only” to do with software packages she’d written for performing certain kinds of statistical tests, and not any sort of “proper” astrophysical question. You could say that her contribution was purely methodological, in the same way that you could say that a language documentation app would be a purely methodological contribution (because it facilitates methods used in pursuit of “genuinely” linguistic research questions).

I’m not sure about a name for this yet, and to be honest, I hadn’t even thought to name it. But giving it a name would be a first step in asserting its legitimacy as a research activity.