Claire, thanks so much for sharing this—I’ve had this bookmarked for weeks and I’ve finally had the chance to sit down and read it with great interest.
Computer science researchers are simply not interested in developing software for curating small data sets like those used in language documentation. Some teams, including ours, have also experimented with hiring people who have undergraduate-level backgrounds in computer science, and deeper backgrounds in linguistics, to develop project-specific documentation software. While this has sometimes been successful, it is not a long-term solution for the field. The software is too project-specific, and because of funding streams, the developers have too little professional stability to maintain the products. A more feasible long-term solution is to integrate language documentation and revitalization content into degree programs in computational linguistics. This has the potential to create cohorts of computational linguists who both see language documentation data as relevant, and have the technical skills to create alternatives to FLEx.
This is a great summary of the issues surrounding who will actually put aside the hundreds of hours over many years to develop and maintain a FLEx alternative. I think the proposed solution (trying to get CL graduate students to do this work) is at once the best proposal I’ve heard so far and a tricky one, for the reason that it is (while not impossible) difficult to turn work on apps into publications which are deemed weighty.
This is certainly true for computer scientists, where it would invariably be seen as “mere app development” at venues where they must publish in order to be competitive on academic job markets. For CL people in linguistics departments, who number much fewer and typically have less experience with software development compared to people in CS departments, I think there is more flexibility in what is seen as credible research output, and there are certain venues that would be interested in platforming this work (certainly ComputEL and AmericasNLP, possibly ACL conferences depending on the exact nature of the work, and probably some language documentation journals such as LD&C). But developing software is still a lot of work, especially if someone is not highly trained in software engineering, and pursuing a research program that heavily involves app development is fraught with risk.
There is, of course, a deus ex machina which may do much to help this issue along in the coming couple of years: LLMs have gotten quite good at assisting in the development of code, and I think it’s credible that in some situations they can make a software developer ~10x or more more productive.
Perhaps the ultimate goal is not, then, a single tool that does everything that every person
working on language needs. That is, the solution is not a single solution, but rather a set of
common data formats and a way to render them, along with support and incentives to use them. […] Linguistic software should be both backwards- and forwards-compatible; modular tools must be able to pass data back and forth, through a common standard.
Having worked on custom parsers for both the Kaytetye and Warlpiri dictionaries (Toolbox-esque), I wonder what this/these common standard(s) could/would look like.
Perhaps incentivising the common standard could be a LangDoc gallery similar to the https://r-graph-gallery.com/ with various displayed examples with each post a MWE that also spells out the expected input format (e.g. Most basic violin plot with ggplot2 – the R Graph Gallery), perhaps wrangled/tidied from the common standard.
I think I might have shared my thoughts on this in this forum before, but my default assumption has been that a format usually can’t achieve widespread usage until it has a single app which can popularize it. Think of what Microsoft Word did for document formats, for example. A format might buck this trend if it could succeed so spectacularly on its technical merits that many app developers and users could be persuaded to invest in it, and I do think there are cases like this (such as HDF) but they seem rarer. And with a domain as complex as “every kind of linguistic analysis you might want to perform on every language”, it does seem hard to settle on something that is both structured to be helpful and sufficiently universal. I’d point to Salt and FoLiA as perhaps the most serious attempts at this that I’m aware of, but I think there are serious usability problems with both that preclude their serialization formats from consideration as contenders for widespread use among documentary linguists as an interchange format.