A single piece of software for documentation?

tihomir · May 18, 2022, 12:36pm

Hi all, and thanks for accepting me to this forum.
I have been thinking a lot lately about whether the time is ripe for our community to start talking about having a single piece of software that has all, or most, of the functions we need in order to process the data we collect.
Currently, there are many possible workflows but all of them involve a number of applications. My understanding is that people mostly use an ELAN-FLEx workflow which comes with its drawbacks. Then there is for example lameta, among other tools, for metadata. As far as I am aware, there is no standard for the documentation of consent.
A single piece of software that does all these things (and perhaps others that I am forgetting) would make things a lot easier. It would also make training fieldworkers a lot easier and will make language documentation more accessible to a wider range of people interested in it.
At first glance, ELAN is perhaps the most sensible starting point since it already offers a lot of functionality when it comes to segmentation, translation, and transcription. Since recently, it also offers interlinearization functions. I haven’t had a chance to try the latter myself and would be happy to hear what others think of it.
How challenging would it be to have integrated functions where metadata and consent can be stored in each ELAN file in a pracitcal and systematic/standardized way?
Do you feel like the ELAN menu could also be made a bit more intuitive?
Would it perhaps be better to think about starting a new tool from scratch?
Needles to say, any such tool would have to be open source and would run on all most common platforms.
Thanks,
Tihomir

pathall · May 18, 2022, 2:22pm

Hi @tihomir, you’re in the right place Welcome!

This is a topic that is near and dear to the hearts of, I would say, most of the people here! The ELAN/Flex tango is problematic indeed, and it breaks up the way we work in counter-intuitive ways — after all, analysis at all linguistic levels takes place throughout the documentation (and description!) process, so having to completely switch software contexts to annotate at one level or another is pretty bonkers.

My personal opinion is that we need to take it from the top: we need to have answers to all these questions:

What do we mean by documentary data?
What are the workflows that documentary linguists and friends use throughout the documentary process?
What kinds of user interfaces do we need to help linguists carry out those workflows?

These are very broad questions, and designing solutions is a very big project. There are a lot of ongoing efforts in that direction, see:

Glam: a new linguistic annotation app (@lgessler’s own project!)
The Moro Database: A client-side documentation interface (A production of @HannahLS & colleagues)
Dative (Not sure who created this but @ejk is our local expert)
docling.js this is my own project, based on my dissertation research and not quite online yet.

So people are thinking about this stuff, and your thoughts are welcome.

I think the specific issue you raise about an interface for informed consent is a very good one. What kind of consent materials are you dealing with? Can you share how you visualize such an interface? One thing I try to do around here is to really encourage people to articulate the user interfaces they need as in as specific a way as possible. We have an ongoing discussion about the design of an interface for collecting botanical terminology (for instance) that you might find interesting here.

Personally, I thinking working up to a single piece of software that does “everything” is quite a long-term prospect. But that doesn’t mean we shouldn’t start thinking about designing new interfaces that handle some of the functionalities that we need.

Thanks for your thoughts!

tihomir · May 18, 2022, 2:41pm

Thanks, @pathall. These are all very useful thoughts.
Regarding consent, I usually record oral consent as an audio/video file, but I know other people use signed forms for example. I imagine that a designated space for consent documentation in whatever software is used for annotation, which can take a range of different file types, would do the job. This way it will be easier for archives to track whether informed consent was collected for each bundle of a collection.
Regarding metadata, when it is integrated into the annotation file itself (say the ELAN file), this will ensure that the metadata is part of the file, rather than a separate file, which can easily be lost in the process of transfers, renaming etc.

pathall · May 18, 2022, 9:45pm

I % agree with this. I don’t know whether there is an extensibility model built into the .eaf In my own project I am using JSON for all data, and I am using a convention to include metadata inside of data files that looks like this (this is an extract from an interlinear text):

{
  "metadata": {
    "title": "Education in Jaro",
    "language": "Hiligaynon",
    "source": "https://www.youtube.com/watch?v=cUqMWG4QJMk",
    "media": "education_in_jaro.webm",
    "fileName": "education_in_jaro-text.json",
    "lastModified": "2019-07-17T16:01:06.046Z",
    "notes": [
      "transcribed with Joshua De Leon as part of the 2014 Fieldmethods class at UCSB, instructor Marianne Mithun.",
      "original YouTube title ‘School Memories’"
    ],
    "speakers": [
      "Juan Lee"
    ],
    "linguists": [
      "Patrick Hall",
      "Joshua De Leon"
    ],
    "links": [
      {
        "type": "audio",
        "file": "education_in_jaro.wav"
      },
      {
        "type": "notes",
        "url": "http://localhost/Languages/hiligaynon/ucsb-fieldmethods/Notes/hil111_2013-02-25_JDL_PH_IlonggoBoyEducationInJaro.notes.txt"
      }
    ]
  },
  "sentences": [
    {
      "transcription": "Hello, akó si Juan Lee.",
      "translation": "Hello, I’m Juan Lee.",
      "words": [
        {
          "form": "hello",
          "gloss": "hello",
          "lang": "en"
        },
        {
          "form": "akó",
          "gloss": "1S.ABS"
        },
        {
          "form": "si",
          "gloss": "PERS"
        },
        {
          "form": "Juan",
          "gloss": "Juan",
          "tags": [
            "name"
          ]
        },
        {
          "form": "Lee",
          "gloss": "Lee",
          "tags": [
            "name"
          ]
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 8.58,
            "end": 9.65
          }
        ]
      },
      "note": ""
    },
    {
      "transcription": "matopic na akó subóng",
      "translation": "I’ll start the topic now",
      "words": [
        {
          "form": "ma-topic",
          "gloss": "IRR-topic",
          "tags": [
            "english"
          ],
          "metadata": {
            "wordClass": "verb"
          }
        },
        {
          "form": "na",
          "gloss": "already"
        },
        {
          "form": "akó",
          "gloss": "1S.ABS"
        },
        {
          "form": "subóng",
          "gloss": "now"
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 9.704,
            "end": 10.714
          }
        ]
      },
      "note": "As if he's going to start a new topic; Of ma-, J says: “Most of the time you use it before an action verb: the process of changing the topic is itself a verb: the process of changing the"
    },
    {
      "transcription": "parte sa mga eskwélahan.",
      "translation": "about the schools",
      "words": [
        {
          "form": "parte",
          "gloss": "part",
          "tags": [
            "spanish"
          ]
        },
        {
          "form": "sa",
          "gloss": "to"
        },
        {
          "form": "mga",
          "gloss": "PL"
        },
        {
          "form": "eskwélahan",
          "gloss": "school",
          "tags": [
            "spanish",
            "spanish:escuela"
          ]
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 10.753,
            "end": 11.818
          }
        ]
      },
      "note": ""
    }
  ]
}

The metadata I’m using here is pretty ad-hoc, but honestly I find it better to just not worry about consistency too much at first — after all, any metadata that’s going to be submitted to an archive is going to have to be modified anyway. As you say, the great advantage of metadata-in-data is that they can’t be separated.

I have seen (and participated in) a lot of group projects where everyone tries to keep a shared metadata spreadsheet up to date. It never works, because it’s a pain in the neck just keeping the data file names updated in such a spreadsheet, let alone other info like genre, speaker, etc etc.

It would be a much better path to write a little program to read all the data files, extract the metadata, and then generate a summary spreadsheet (or in some cases, an HTML presentation) from the compiled metadata.

lgessler · May 19, 2022, 10:19pm

Like Pat said, I think you’re right on about the benefits of a consolidated app. The chief obstacles for producing one, as I see it, are the following:

Incredibly varied requirements: no two documentation projects are alike. Consider how each person has their own descriptive framework, data requirements, metadata requirements, pet theoretical research interests, archival requirements, etc., and you’re left with a huge range of demands for an omnibus app, some of which may be mutually incompatible. (Things aren’t so bad with, say, inventory management software: at the very least, there are physical or virtual things, and you have some of them; linguists, on the other hand, can’t even agree on what a word is.)
Nobody wants to make it: building software is labor intensive, and evidently, nobody yet has been motivated and able to produce a truly successful omnibus app. Obviously the private sector has no interest since there is no potential for profit; academics in NLP are not rewarded for mere application software development, and so avoid it; academics in linguistics are more interested but often lack prior software development experience. A few exceptional orgs like SIL and MPI Psycholinguistics exist where full-time staff engineers produce apps, though these apps still (!) can’t expand to fill all needs.

Enough to turn you off of the idea right? Like Pat said, and I agree entirely, it’s quite a long-term prospect. Though I think there are feasible ways forward, or else I wouldn’t be working on Glam, and I’m also very excited by the work Pat and others have been doing in this area. It’ll certainly take more vision than any one person has here in order to arrive at a full solution.

Tangentially, this might be a good time for me to shamelessly plug my upcoming paper at ComputEL-5, where I argue that an omnibus app like the one we’re describing is the key missing ingredient that has prevented NLP assistance from becoming more prevalent in language documentation. I might highlight in particular a point that I don’t hear raised often: that creating and sustaining an app like this might necessitate a cultural shift in NLP/CL/linguistics that would result in greater reward for “infrastructural” projects like developing apps.

xrotwng · May 20, 2022, 6:04am

I’d go further and suggest that a cultural shift in academia at large (in line with what the Research Software Alliance works for) would be required. From my experience, creating successful large-scale software requires longterm perspectives for the people involved - which isn’t something a particular field is likely to achieve in isolation. (And large-scale software development isn’t guaranteed to be any good either so resistance against such a change can be reasonable, too.)

SarahRMoeller · May 26, 2022, 9:32pm

Despite the dreams of my youth I now find the call for “one software to rule them all” a strange thing. We don’t expect it in other areas. Why in linguistics? For example, I use Word to read and edit documents, Powerpoint to make slides, Excel to track data, another software to read and edit PDFs, and GitHub to store, back up, and share items. What’s more, I may also use the Google or Libre office versions of these tools. All my work in these various tools might contribute to the same project. Each tool has its primary purpose, along with its own strengths, and weaknesses.

There are two things I find annoying:

When someone sends something in a format that I can only use in one software tool!
Multiple tools designed for the same purpose with minimal difference in functionality. I DO wish there was one software to rule Messenger, Discord, Signal, WhatsApp, and a few others!

So with my minimal understanding of software design, these are my thoughts:

Expect multiple tools to be the norm.
Choose to only use or design tools whose outputs can be stored (e.g. SayMore/lameta can store ELAN files) read, and preferably edited in more than one tool. (“Interoperability” from the 7 dimensions of portability, anyone?)
If you are involved in designing a new software for the field, don’t (only) aim to solve the perceived problems of a still supported and popular software. Instead, identify a unique primary purpose to your software and do that well. For example, while I can conveniently create tables in Word, I can do a lot more with tables in Excel. More importantly, I can easily copy tables from one to the other.

pathall · May 27, 2022, 1:48am

Yeah, I personally think this is most likely, at least as a first step. Not only should we expect multiple tools, we should encourage the design of multiple tools. We should invite (interested) linguists into the process of application design, and and equip them them with the information they need to articulate in 1) the shape of data they are trying to produce and 2) what kind of workflows — ordered steps — they take in order to collect that data.

ELAN uses an XML format, EAF (ELAN annotation format) whose design is honestly rather bonkers. But it’s still XML, and XML is a fairly future-proof format that will surely be parseable in the long term. EAF files in archives are safe in the sense of preserved — there are a lot of ELAN parsers out there (one was mentioned today, to pick a random example).

But it’s my impression that most people around here wouldn’t look to EAF as the a primary storage format going forward. Personally I am fond of a JSON based approach, but this is a discussion we should have as a discipline. Definitely we need to come up with something standard, as well as clearly documented standard conversion pathways for other formats we use.

Couldn’t agree more. I am hopeful about an approach based on Web Components, where data is stored and persisted as JSON, and (many!) web-component-based user interfaces consume, modify, and output data in that format. But that’s just one possibility. I think that the development of other tools, built with other platforms, should be encouraged too. What we really need is a Cambrian Explosion of user interfaces for documentation.

We need to be strict about a defined (but extensible) core data model, and then we need to go hog-wild about designing and testing user interfaces for using that data.

xrotwng · May 27, 2022, 6:50am

I agree that EAF can only be one step towards a proper data format for linguistic data. It seems clear, that in its current form it is basically a serialization of the application-internal data model. Tellingly, EAF parser are either just shallow wrappers around an XML DOM, or come with disclaimers like (from the tool linked above)

Now, this parser is not universal (the parser knows the specific features of the tiers)

From my experience, it could still be smart to base a “proper linguistic data format” on a specialization of EAF. Because standards that start out with no tool support typically have a hard time being adopted.

So the next step after the “extracted internal data model” stage would be better specified, linguistically meaningful/explicit formats. Just like MDF vs. SFM, or CLDF vs. the clld web app data model.

Since I don’t really see one of these second-stage formats as clear winner, the next best thing might be something like pandoc for linguistic data formats. And just like with pandoc, there can be lossy conversions (and occasionally messy ones). So while user interfaces are important, there seems to be no way around linguists being knowledgeable (on a farily detailed level) about the formats they are working with to make informed decisions, assess quality of conversions, etc.

fauxneticien · May 27, 2022, 1:59pm

I was trying to find out how the ‘Recognizers’ in ELAN work and according to this presentation (not sure from when), apparently the under-the-hood implementation on the server side makes use of CLAM:

Introduction — CLAM 3.1.5 documentation
GitHub - proycon/clam: Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice., which seems to be actively maintained (see commits).

Does anyone know Maarten van Gompel? I can imagine there would be untold stories of data model design and unexpected edge cases they’ve come across…

lgessler · May 27, 2022, 4:30pm

It’s great to see all this excellent discussion! Let me try to explain the potential I see in centralization.

First of all, I me agree that trying to build a monolith is a daunting task and, on many particular ways to approach it, an impossible one. But as I think many would agree, building an app involves a lot of “boring” work that needs to be done and redone for each app: these are things like login and user systems, data synchronization systems, remote communication between frontend and backend, and database configuration. These components are technically challenging, and need to be redone for each app. These are consequential components, too: many of them can make or break the usability of an app, and they are mostly required for any of the “interesting” work.

But the fact that all apps require these common software components presents an irresistible opportunity to implement these once and for all, so that the functionality that might have otherwise spanned multiple separate apps might be brought into one, with additional synergistic benefits beyond just avoiding the re-implementation of difficult components: the fact that a single integrated database is handling all data needs, for instance, eliminates all the pain of having to ferry serializations from app to app.

Of course, this would require the core system to be totally capable of addressing all the demands which any given UI or workflow might place on it. Is this possible without an explosion of complexity and constant ongoing maintenance churn? This is a whole topic on its own, but cautiously, I’d say yes.

The biggest issue for linguistic apps, IMO, is the fact that everyone’s data model is so different. But there are industries where things are just as complicated, and yet there has been practical convergence on software monoliths. In healthcare, where I used to work, the standard architecture is that a core system managing patient records sits at the bottom, and subsystems for different medical specialties (anesthesia, cardiology, dermatology, …) sit on top. Consider also that every, say, cardiology department is going to want to do things slightly differently, and this seems like an impossible amount of complexity to tame in a monolith. But by being careful about what to include in the core system, and making as much functionality user-configurable as possible, it has managed to work.

Back to linguistic apps now, there are no two language documentation projects with the exact same data needs. But instead of baking a data model directly into an app, which in the limit might require a separate app for every language documentation project, could you make each project’s data model configurable given a small inventory of data model “atoms” with which to compose a project’s data schema? Linguists and computer scientists think the answer’s probably yes, and if we suppose it’s true, then the upshot is that you could devise a core system that could handle the most painful and boring parts of developing an app in a way that is also impervious to the vicissitudes of the unpredictable and incredibly diverse requirements of language documentation projects. For example, consider how you might model the data inside a text: instead of hard-coding the fact that texts have morphemes, glosses, and a free translation per-line, and leaving users to deal with the fact that they might want a different kind of data model (e.g. another morpheme-level line for lexical category), you could instead offer them the building blocks for a model, e.g. tokens, single-token spans, and line-level spans, and allow them to compose them in order to reach the exact data model they want.

With that in place, you could then implement apps on top of this core system focusing just on the “interesting” parts, namely the particular UIs which are appropriate for the app, the data modeling requirements and constraints, and any novel export/import formats which need to be supported. Moreover, what’s exciting is that all three of these non-core items are, IMO, ones that can be tackled and done very well without a great deal of software engineering experience, whereas the components that go into the core system I think are much more challenging. This means that the development of apps could be democratized—a linguist who knows JS but not the ins and outs of full-stack web application development could conceivably build a bespoke UI knowing only rudiments of web development.

It remains to be seen, of course, whether we could build such a core system. The biggest issue, I think, is whether any universal data model—i.e. set of data atoms, as I’ve called them—could truly serve everyone. But if you consider just the shape of the data that most language documentation projects have, I’m pretty optimistic it could work.

pathall · May 27, 2022, 8:04pm

I have one bazillion things to say about this but I’m supposed to be in vacation this weekend. Looking forward to talking more next week! I hope others continue to chime in.

SarahRMoeller · June 1, 2022, 8:26pm

That’s already a big ask. And probably will be for the next 10 years or so.

A universal data model with an ecosystem of various apps using it is an ideal. But, again to keep things in perspective, what is ideal and what practical are two different things. Just think how long it was after Toolbox was deprecated that events like CoLang were still found it valuable to regularly offer workshops alongside FLEx and ELAN. Both ideal and practical have to be planned for.

pathall · June 2, 2022, 1:00am

Well… I’m not sure that Python or R are bigger asks… it’s just that they have some institutional support in linguistics. Also, I think the very idea of building user interfaces is still quite out of town for a lot of linguists — it often evokes a kind of “you want to what now?” response.

Obviously I’m invested in JS and therefore biased, but I think building stuff in the web platform has a lot of advantages.

People really respond to making an interface do something. It’s fun.
Like Python and R, it’s a transferable skill.
There are adjacent skill sets (web design, basically — HTML and CSS) that also offer learning opportunities, and for some people learning those might be more interesting than JS.
The universality of the web is hard to beat.

As for ideal vs practical, I couldn’t agree more. We should not throw away ELAN or Flex or Toolbox out of hand, let alone our .eaf or .flextext files. We need lots of conversion pathways. That’s a non-trivial amount of work but it’s necessary.