What I believe the field of Language Documentation really needs to do

pathall · April 24, 2020, 3:45pm

Disclaimer: this is just my opinion, nothing more.

I hope you feel free to post your opinions on this forum, too. It’s here for you.

This is my honest diagnosis of our field’s situation with regard to technology.

By way of background, I’m currently writing my dissertation on the use of the web platform for building applications for language documentation. I don’t say that because I want to sound fancy-pants, I just say it because this topic is really pretty much all I think about professionally. I have been at it for a long time. It overrode my own interest in doing fieldwork pretty much entirely, which for me is a bummer. But I feel like these problems need to be addressed if the field is to stay on top of technology.

We need to stop thinking in terms of software workarounds, and start thinking in terms of workflows. What would the user interfaces that would help us in doing our actual documentation throughout the process look like? It’s true that we all owe a debt to ELAN, Flex, Toolbox, and other existing tools for helping us to document a ton of language. But we have reached a sort of stagnation in the field as far as tech is concerned. We have journal articles about how to force pieces of software to work to together that were not designed to work together. Should we be institutionalizing workarounds? No, we should be implementing applications that are designed for specific workflows. We need to start imagining those applications, and talking about how they could be. Only then can we actually create them.
We need to embrace the web. Language documentation is not really “part of” the web. Not really. The web is supposed to be about interlinked, usable information. But is our information — that is, documentary information in the terms that we use to theorize and describe language — really interlinked? Nope. We have made a lot of progress in archiving, but the way that our archiving is built is really in terms of documents, corpora and metadata. What about the data itself? Why can’t we look up all the things in a text or corpus of a particular grammatical category? Why can’t I slurp out all the pronouns (or whatever) from my corpus and line them up next to all the pronouns (or whatever) in your corpus automatically?
What the web can do now, what “the web” even is is not obvious. It’s not just web pages. It’s a programming environment. It can support all our stuff. First, we need to get to know the web again. Poke it. Prod it. Then we can get involved in the question of how to use it to solve our (rarely unique) problems.
The computer calvalry is not coming. There is no funding. There are no grants. There is no gang of programmers waiting to help us. They are not there. Okay, maybe that’s a little melodramatic. But honestly, we may as well assume that viewpoint. As far as a new start in using web technology for language documentation is concerned, it has to be an inside job: we must build it. And where skills are missing, we should work together as a field to learn them, together.
It’s not actually that hard (it’s even fun). Building things with web technology is fun. It’s not as hard to get started as it might seem. All you need is a text editor, a browser, and some gumption. If you learn some basics, even if you don’t become “a programmer,” you will be much more empowered to manage the digital incarnation of your research as you see fit. You will be more able to reformat and reuse your work. And most important of all, you will be more able to share with communities.

Which isn’t to say that web technology isn’t complicated. It is. But hey, so is language. And you’re a linguist, right? You can handle complexity. You’ve traced down wandering Wackernagels and uvular ejectives, what are a few tags and objects to you?

I feel kind of preachy and high-horse-y when I start rambling on this topic. What can I say? I have heavily invested the better part of my adult life in the “web linguistics” concept. So it matters a lot to me. More than anything, I hope that we can use this forum to learn together and work together. Especially those of us who have had the good fortune to be educated at fancy unversities need to find ways to pay it forward. Let’s kick our research in the seat of its documentary pants, and make some whiz-bang stuff.

pathall · April 25, 2020, 1:51am

Ok but like… don’t thou me, I’m just a dude

I’m slightly embarrassed by the tone of this topic but I honestly sweat angst about stuff like this. I really, really want a ragtag band of upstart intergalactic smuggler documentary linguists to just take the bull by the horns and —

…wait, there I go again.

pathall · April 25, 2020, 1:42pm

Hey I wanted to be Chewwie

rgriscom · December 17, 2020, 9:26am

Just running through old threads…

I thought of your dissertation, Pat, when I saw this guide to digital dissertations posted on a DH mailing list:
https://digitalfellows.commons.gc.cuny.edu/digital-dissertations/

I definitely agree with this - even if as a beginner programmer, I find developing data formatting workarounds to be more accessible projects than developing new software from scratch! As someone who sits right in the middle of the somewhat artificial divide between those who code and those who do not, I can understand how it is hard to grasp the big picture of a software design for something that doesn’t exist yet.

In The Art of Community, Jono Bacon suggest that those wishing to start an open source community should have some sort of foundation for newcomers to build on. Would it be possible for you (and 1 or 2 others?) to start a web-based linguistics software project that includes some of the less controversial or complicated features, and then others could easily join and add to it?

pathall · December 17, 2020, 8:41pm

Huh thanks, this is an interesting document. I can point my committee at it!

This got long…

As for a foundation for newcomers, that is a priority for me. I think a lot about how to provide “multiple onramps” to web-based software. I have tried and failed to come up with an elevator pitch for what I’m up to — it has a lot of moving parts — but here’s the interpretive dance shpiel:

Data can be represented as objects and arrays of objects (in Python, “dictionaries” and “lists”)

The modern web can represent both documents (an interlinear text, a dictionary, a time-aligned text…) and applications (an interlinear text editor, a dictionary formatter, a tool for aligning text and media into an interlinear text…) using HTML.

So, we standardize on data a tiny bit (keeping it extensible), then we create a library of custom HTML tags that can “do stuff” with that data. The functionality (essentially, the programming bit) is built into the tag.

So if you have:

<text-view src="some-text.json"></text-view>

…then you get a decent web-rendering of an interlinear text with wrapping and responsive behavior on phones and such. The key point is that it works in the same way that standard HTML tags work. So for instance, the built-in <video> tag works like this:

<video src="some-video.mp4" controls></video>

Then you get a video player. You don’t have to know anything about how video processing works, or even how the “play” button makes the thing start playing. You just learn the tag. So we create a bunch of “documentation tags”.

My hope is that this will be an accessible starting point for a lot of people. With some basic HTML skills you can get to web-distributable documentation. I really want people to feel like creating their own web content is within reach.

This is getting long isn’t it

So that’s the starting point. But as you suggest, it is very important that there not only be a simple starting point, there has to be a simple way to extend the whole thing.

That’s the dream! Because, the custom elements <text-view> etc) are just defined as Javascript, one could go beyond the default set of custom elements I’m describing to new compositions of those elements. People have all kinds of specific workflows that they need to get done, and if there’s a ton of work involved, it might be worth developing a custom interface to do it.

Just as a quick example, maybe you are starting with .eaf files, as opposed to the “objects and arrays” standard format I mentioned above (JSON, in fact). Well, then you write something called, maybe, <eaf-loader src="my-story.eaf"></eaf-loader>, and it loads the ELAN file, parses it into objects and arrays, and hands that to a run-of-the-mill <text-view>. Totally doable, and if you have 200 .eaf files, totally worth it!

So yeah, this project is what I have been working on since erm… good lord, 2008? It has gone through many, many revisions (the first one was in PHP!). Each revision I have tried to make it simpler and easier to explain.

So this isn’t really meant to go beyond our little treehouse , but the preliminary version of the code base is actually sitting on the same server as this site:

https://docling.net/book/docling

Here’s a text-view, for instance:

https://docling.net/book/docling/corpus/text-view/text-view.html

Here’s a more custom app:

https://docling.net/book/docling/components/prompt-record-translate/prompt-record-translate.html

This one allows you to record a time-aligned word list. The HTML for this is relatively simple:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>prompt-record-translate</title>
<style>
@import '../css/components/prompt-record-translate.css';
</style>
</head>
<body>
<header><h1>prompt-record-translate</h1></header>

<prompt-record-translate id=animals prompts="dog cat bird"></prompt-record-translate>

<script type=module src="./PromptRecordTranslate.js"></script>
</body>
</html>

The whole project is still a bit of a mess because I’m forever trying to make it more consistent.

Speaking of messes, you can see some example data under:

https://docling.net/book/data/languages/

That example text above is loading:

https://docling.net/book/data/languages/esperanto/corpus/proverbs-text.json

So those are examples of most of the kinds of pieces: data and apps.

If you (or anyone here) would like a guided tour through the shambolic state of the project I would love to do that. Maybe at a coffee hour?

Thanks for reading this far!

rgriscom · December 20, 2020, 6:21pm

Thanks for this intro, Pat! How would this system relate to the archiving of data (e.g. at a language archive or other repository)? If, as we discussed with LingView, a repository API would allow you to access contents of (open access) collections, you could then view them using the same interface?

pathall · December 23, 2020, 6:15pm

Right now, we have situations like this, more or less:

somearchive.edu/
  some-language/
     story-01/
       story-01.wav
       story-01.eaf
    story-02/
    more stuff…

Obviously this is a grotesque simplification, but that’s the general pattern: archives are most often serving as “data repositories”. To actually use that data, you have 1) download it and 2) open it up in ELAN.

IMHO, this kind of sucks, not because having an archive isn’t good (it’s great! persistence! preservation! findability!), but because the language isn’t really… speaking, to visitors to those sites. It’s like the voices are there, but they’re still bottled up, you know?

Yes, I think so. The approach that I’m working on isn’t profoundly different in terms of what’s stored on the server, there are just a few extra files stored on the library and in the “bundles”. So for instance, something like this:

somearchive.edu/
  js/ 👈 a directory of Javascript that can be shared across the site
    docling/ 👈 §1       text-view/ 👈 stuff for rendering an interlinear text, for instance
      lexicon-view/ 👈 stuff for rendering a lexicon…
      other-stuff/ …and so on.
  some-language/
    story-01/
       story-01.wav
       story-01.json 👈 a different data storage format
       story-01.html 👈 this pulls in the Javascript library and makes the json interactive
    story-02/
       story-02.wav
       story-02.json 👈 different data
       story-02.html 👈 shared code used in story-01
    lexicon/
       lexicon.json 👈 one way to store a lexicon. or might also be built dynamically from story-*.json
       lexicon.html 👈 a simple interface for a lexicon

§1 My own library fits here, but the same idea would work with other implementations. LingView did something similar using a general Javascript library called React. Perhaps that is a better choice in the long run. (My own opinion on that is that React is a bit of a beast. I try to keep the abstractions as “linguistic” as possible by relying on Web Components. This is as they say a whole ’nother topic.)

In general, the reason I like this approach is that it’s providing a user interface to archive visitors without a complete overhaul of the current setup: there are still bundled directories of stuff, we just add 1) the Javascript library and 2) some .html files that slurp in the library and the data files, and we get a pretty useful default interface for them.

I’m not sure if I’m directly answering your question re accessing the contents of collections, but I hope this tiny HTML page at least seems interesting:

<!doctype html>
<title>A Story in Faux</title>
<script type=module src="/docling/index.js"></script>
<text-view src="story-01.json"></text-view>

And from that you would get an interlinear text with playback, for starters.

And we haven’t even talked about the editor side of this yet. We can also have things like story-editor.html that would provide an interface that creates a story-0N.json file, does things with time alignment, even recording media… There are many slices in the pie.

Thank you for your interest! I hope we can keep talking about this stuff, it’s really helpful to me to try to explain it.

pathall · December 23, 2020, 6:38pm

Oh also, a couple demos:

where the key bit is:

<text-view src='../../../data/languages/esperanto/corpus/proverbs-text.json'></text-view>
<script type=module>
import {TextView} from './TextView.js'
</script>

and

<lexicon-view src=/book/data/languages/esperanto/lexicon/small_esperanto-lexicon.json></lexicon-view>

<script type=module>
import {LexiconView} from './LexiconView.js'
</script>

(Actually the import methods here are bit different and more specific than described above — there is more than one way to do it.)

rgriscom · December 27, 2020, 9:42am

Thanks, yes based on your description I could see some similarities between what you are aiming to accomplish and what the LingView folks are doing. There are also elements of it that remind me a bit of Language Depot, the online system with version control that allows for collaborative FLEx projects (based on Redmine apparently).

So, if I understand correctly, these are some of the possible options:

Language archives develop their own web-based interfaces for creating, editing, viewing data
Language archives develop APIs that allow for the use of an external interface to view and possibly also create and edit data
Language archives don’t develop APIs or interfaces and data curators must independently host the data elsewhere if they want a good web-based interface

#3 is where we are currently, but are you imagining a #1 or a #2 sort of situation? Or are we forever stuck with #3? And to accomplish either #1 or #2, what sort of buy-in is required from language archive managers, depositors, etc.? That is, if we assume the technological aspects can be sorted out, what are the social barriers to realizing an arrangement like #1 or #2?

One reason I’m asking these questions is because I’m pretty much sold on the idea that these things are possible, that they would bring benefits for both researcher and community member user groups, and that they would provide much needed added value to the language archives. I wonder then who the target audience is that needs to be convinced for this to actually happen, and what concerns they might have (e.g. financial, technological).

Maybe a backcasting approach could be useful here to identify the developmental steps that would need to take place in order to achieve the desired outcome.

rgriscom · December 29, 2020, 8:39am

I’ve been pondering over this a bit and have come up with two arguments in favor of #2 (as opposed to or in addition to #1):

reproducibility/replicability: If access to resources for research use is made through a single common method, then it is easier for other researchers to replicate each other’s research
scalability: If it were possible to submit a query across multiple files, datasets, or archives, it would be possible to easily analyze larger datasets or patterns across datasets. As it is currently, use of archival materials is restricted to small-scale work because of the manual labor require to access the data.

It looks like ELAR may have an API via Preservica at some point in the near future:

rgriscom · January 3, 2021, 11:15am

Hugh Paterson shared this with me: the DLx project from Daniel Hieber - maybe of relevance? Looks like he is focused on making a standardized JSON schema for linguistic data.