Digital Infrastructure paper

cbowern · February 22, 2025, 6:53pm

This paper recently got accepted (to the Language documentation portion of Language). I’ve put it here in the general forum rather than publications because it’s a topic that we probably all have opinions on and might want to discuss at some point!

Last calendar year, I (Amalia Skilton) revised a manuscript with Sophie Pierson (UT Austin, now at IXL Learning), Sunny Ananthanarayan (Washington), and Claire Bowern (Yale) on consequences of the broken digital infrastructure for language documentation.

This manuscript has been accepted with major revisions at Language, Language Documentation & Revitalization section.

The abstract reads:

Linguists need to collect, organize, analyze, and share data. Despite the variability of linguistic research in terms of inputs and outcomes, there are some common stages, including morphological parsing, lexicon creation, and textual editing. A common software tool for this work is FieldWorks Language Explorer (FLEx). In this article, we examine some of the ways that FLEx’s data structures conflict with contemporary practices in language work. These assumptions exclude key groups of users and erect barriers to theoretically and practically relevant research. Where workarounds are feasible, they are both fragile and costly. We use this example to start a broader conversation about software infrastructure for digital linguistic data analysis, and how problems with this infrastructure reflect larger issues in the discipline.

Here’s the link to the full (accepted) manuscript: https://blogs.ed.ac.uk/amaliaskilton/wp-content/uploads/sites/10135/2024/10/Digital-Infrastructure_AWR_Submitted.pdf

fauxneticien · March 4, 2025, 4:58am

Thanks for sharing @cbowern !

I admit I skimmed/scrolled through a lot of details until this caught my eye in the Reimagining language tools section:

Perhaps the ultimate goal is not, then, a single tool that does everything that every person
working on language needs. That is, the solution is not a single solution, but rather a set of
common data formats and a way to render them, along with support and incentives to use them. […] Linguistic software should be both backwards- and forwards-compatible; modular tools must be able to pass data back and forth, through a common standard.

Having worked on custom parsers for both the Kaytetye and Warlpiri dictionaries (Toolbox-esque), I wonder what this/these common standard(s) could/would look like.

Perhaps incentivising the common standard could be a LangDoc gallery similar to the https://r-graph-gallery.com/ with various displayed examples with each post a MWE that also spells out the expected input format (e.g. Most basic violin plot with ggplot2 – the R Graph Gallery), perhaps wrangled/tidied from the common standard.

lgessler · March 16, 2025, 2:04pm

Claire, thanks so much for sharing this—I’ve had this bookmarked for weeks and I’ve finally had the chance to sit down and read it with great interest.

Computer science researchers are simply not interested in developing software for curating small data sets like those used in language documentation. Some teams, including ours, have also experimented with hiring people who have undergraduate-level backgrounds in computer science, and deeper backgrounds in linguistics, to develop project-specific documentation software. While this has sometimes been successful, it is not a long-term solution for the field. The software is too project-specific, and because of funding streams, the developers have too little professional stability to maintain the products. A more feasible long-term solution is to integrate language documentation and revitalization content into degree programs in computational linguistics. This has the potential to create cohorts of computational linguists who both see language documentation data as relevant, and have the technical skills to create alternatives to FLEx.

This is a great summary of the issues surrounding who will actually put aside the hundreds of hours over many years to develop and maintain a FLEx alternative. I think the proposed solution (trying to get CL graduate students to do this work) is at once the best proposal I’ve heard so far and a tricky one, for the reason that it is (while not impossible) difficult to turn work on apps into publications which are deemed weighty.

This is certainly true for computer scientists, where it would invariably be seen as “mere app development” at venues where they must publish in order to be competitive on academic job markets. For CL people in linguistics departments, who number much fewer and typically have less experience with software development compared to people in CS departments, I think there is more flexibility in what is seen as credible research output, and there are certain venues that would be interested in platforming this work (certainly ComputEL and AmericasNLP, possibly ACL conferences depending on the exact nature of the work, and probably some language documentation journals such as LD&C). But developing software is still a lot of work, especially if someone is not highly trained in software engineering, and pursuing a research program that heavily involves app development is fraught with risk.

There is, of course, a deus ex machina which may do much to help this issue along in the coming couple of years: LLMs have gotten quite good at assisting in the development of code, and I think it’s credible that in some situations they can make a software developer ~10x or more more productive.

Perhaps the ultimate goal is not, then, a single tool that does everything that every person
working on language needs. That is, the solution is not a single solution, but rather a set of
common data formats and a way to render them, along with support and incentives to use them. […] Linguistic software should be both backwards- and forwards-compatible; modular tools must be able to pass data back and forth, through a common standard.

Having worked on custom parsers for both the Kaytetye and Warlpiri dictionaries (Toolbox-esque), I wonder what this/these common standard(s) could/would look like.

Perhaps incentivising the common standard could be a LangDoc gallery similar to the https://r-graph-gallery.com/ with various displayed examples with each post a MWE that also spells out the expected input format (e.g. Most basic violin plot with ggplot2 – the R Graph Gallery), perhaps wrangled/tidied from the common standard.

I think I might have shared my thoughts on this in this forum before, but my default assumption has been that a format usually can’t achieve widespread usage until it has a single app which can popularize it. Think of what Microsoft Word did for document formats, for example. A format might buck this trend if it could succeed so spectacularly on its technical merits that many app developers and users could be persuaded to invest in it, and I do think there are cases like this (such as HDF) but they seem rarer. And with a domain as complex as “every kind of linguistic analysis you might want to perform on every language”, it does seem hard to settle on something that is both structured to be helpful and sufficiently universal. I’d point to Salt and FoLiA as perhaps the most serious attempts at this that I’m aware of, but I think there are serious usability problems with both that preclude their serialization formats from consideration as contenders for widespread use among documentary linguists as an interchange format.

amaliaskilton · March 17, 2025, 1:04pm

Thanks for all of these comments @lgessler. Lots of audiences to this paper have asked me what our solution is - usually the assumption is “hire 1 person to develop an alternative from scratch!”. So it’s refreshing to have one here that recognizes the complexity of human, sociological and technical factors.

When it comes to formats, do you think that - for example - coming to a common format for storing lexical data is more feasible than “shared formats for all data types ever”?

xrotwng · March 21, 2025, 3:12pm

For what it’s worth, here’s some sort of “gallery” for what you can do with CLDF: GitHub - cldf/cldfviz: A python library providing tools to visualize data from CLDF datasets.