📄 [Paper] LingView: A Web Interface for Viewing FLEx and ELAN Files

pathall · June 15, 2020, 4:16pm

https://scholarspace.manoa.hawaii.edu/handle/10125/24916

This one just came out in LD&C:

Pride, Kalinda, Nicholas Tomlin & Scott AnderBois. 2020. LingView: A Web Interface for Viewing FLEx and ELAN Files. Language Documentation & Conservation 14. University of Hawaii Press. 87–107.

This article presents LingView (GitHub - BrownCLPS/LingView: A web interface for viewing ELAN and FLEx files:), a web interface for viewing FLEx and ELAN files, optionally time-synced with corresponding audio or video files. While FLEx and ELAN are useful tools for many linguists, the resulting annotated files are often inaccessible to the general public. Here, we describe a data pipeline for combining FLEx and ELAN files into a single JSON format which can be displayed on the web. While this software was originally built as part of the A’ingae Language Documentation Project to display a corpus of materials in A’ingae, the software was designed to be a flexible resource for a variety of different communities, researchers, and materials.

I got it running on my laptop, seems like a cool way to view ELAN or Flextext files interactively. Here’s a screenshot:

It has automatic scrolling and you can toggle which fields you want to view (I have it set to the default, which is to show everything). I haven’t finished reading the article yet, myself, but I skimmed it, and I was surprised to a see a comment to the effect that the Flextext format (the native format for Flex) is apparently getting support for time alignment information.

rgriscom · June 16, 2020, 6:36pm

Nice! Is external linking to specific annotations possible?

pathall · June 16, 2020, 6:42pm

Hmm good question, let me take a look…

Unless I’m missing something (which is probable if not certain ), there doesn’t seem to be anything like a linkable id at the sentence level. Looks like they’re storing the timestamps as data- attributes, which of course can’t be linked into with a hash tag. (It occurs to me I’m not making linkable sentence elements with ids in my own stuff either, d’oh! Issue to self…)

rgriscom · June 17, 2020, 1:47pm

Thanks for checking! I sent an email to one of the authors, so maybe we can get some extra info about this. That is interesting that flextext will be getting support for time alignment.

I also saw a presentation at ComputEL last year about a recreated and simplified online version of ELAN, which you can fiddle around with here: https://lgessler.com/ewan/ It only plays files, no editing.

As soon as it is possible to connect these ideas to the language archives and other repositories, then we will have a real jump forward in terms of data citation and reuse!

pathall · June 17, 2020, 2:02pm

Hey, that’s @lgessler’s project! All the cool kids are here. @rgriscom, do you do any Javascript in addition to Python? I’m all in on Javascript, myself. I talk about it too much.

As far as referencing particular points of an interlinear text, there’s also the interesting question of how to reference a subset of “lines” in a text. I think it’s quite common that people want to refer to a stretch of discourse, for instance. I have been musing about some sort of semi-standard notation to reference things like that, so you could imagine some-online-text-viewer.html#s=1-4 or some-online-text-viewer.html#s=1,3 or something (by analogy to the syntax for media fragments (spec, caniuse), where some-file.wav#t=1,10 can be used to point to for seconds 1-10 of a media file).

Hmm, just looked up the Github repo for @lgessler’s project, looks like it’s no longer in development. Rendering all those tiers (with SVG!) is quite an accomplishment. The associated paper is worth a look as well:

Gessler, Luke. 2019. Developing without developers: choosing labor-saving tools for language documentation apps. Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers) , 6–13. Honolulu: Association for Computational Linguistics. Developing without developers: choosing labor-saving tools for language documentation apps - ACL Anthology.

rgriscom · June 17, 2020, 2:52pm

I’m not up on my Javascript as of yet (and honestly still working on my Python ), but yes you are right that a citation in a publication could refer to multiple annotations or a stretch of a recording that starts in the middle of one annotation and ends in the middle of another. So perhaps we could imagine a standard for linking to a set of one or more annotations and a standard for linking to time segments?

The CORAAL search function is an example of an interface built with PHP and Javascript that allows for the creation of a simple URL with a specified starting time that can be used in a publication, such as this:
http://lingtools.uoregon.edu/coraal/explorer/browse.php?what=ATL_se0_ag1_m_01_1.txt&line=152&settime=190.17

Previously I thought that the search results allowed the user to go directly to the particular line of text by clicking on the filename on the left (you can see the “&line=152” in the URL), but that doesn’t appear to be working for me right now. But something as simple as this would be a good start. I think it will be quite some time before we see such a thing integrated into language archives, but a first step could be to try it out using data hosted at one of the larger general purpose data repositories, to demonstrate that it is possible. Conceivably, couldn’t one use the OSF API to pull data from a repository and display it in such an interface?

lgessler · June 17, 2020, 10:59pm

Thanks for the kind words about EWAN! Yeah, development is not ongoing since it was intended as a proof of concept of a way to approach software engineering for our domain.

LingView looks nice, it’s good that there’s been a lot of activity in the app space for making archival materials accessible for viewing. Some other ones that come to mind are Kwaras and Kratylos.

As an aside, at the same time, one wonders what could be achieved if all the effort put into these disparate apps could have been brought together under a single app. The obstacles to that are obviously social and organizational in part, but I think there’s also a technical barrier: even if you’re a very experienced programmer, there’s always some friction caused by differences in coding style, and more importantly, coordination would require negotiation of competing feature requests (what if Alice is fine with purely concatenative morphology but Bob works on Semitic?), which could require some hard thought to fully resolve. I don’t have any answers, but I think we should all be thinking about how to find ways to work together when we develop tech, since our labor is already so limited.

rgriscom · June 18, 2020, 11:13am

Thanks for these thoughts, Luke. How about this idea for a general purpose interface app?

The problem: The biggest issue I see on the data citation/preservation side is that all of these tools are based around the idea of independently hosted corpora/datasets, which isn’t a long-term solution. The goal behind the language archive/data repository is that we make a place to store our data for the long term, which is necessary both to prevent loss of data but also to make data citations possible. General purpose repositories will never build their own interface for linguistic data because we are a small minority of the userbase. Language archives can build their own interfaces, but they don’t have the financial resources to host unlimited data and so much research data will need to be hosted at general purpose repositories.

A potential solution: If someone made an interface like Kwaras/EWAN/Kratylos/Namuti that could be used to link to specific segments/annotations with audio files (i.e. .TextGrid/.eaf + .wav) on a general purpose repository like OSF, then anyone authoring a publication or other kind of resource based on OA data could link each text example to the original audio source. That would apply to most practitioners of language documentation, many of whom produce OA data, but also descriptive linguists and eventually theoretical linguists (who often base their claims on descriptive data). Language archives could then either develop their own interfaces based on this model or provide a means of linking their own data to the same interface.

lgessler · June 18, 2020, 7:11pm

Assuming some kind of software like you describe existed, it’d be great if the OSF or some other large organization with an engineering staff could pick it up, but I guess I’m unsure of whether they’d be willing to take on the effort of integrating it into their software and maintaining it, especially if it was as yet unclear whether the community it’s intended to serve would accept it. If any kind of (informal or formal) standardization happened in the language documentation community on a tool for viewing and citing archival data, I expect it’d happen if a large archive like PARADISEC picked it up and it gained popularity.

rgriscom · June 19, 2020, 6:01am

Yes it’d be great if there was a single standard for everyone! I guess my technical question for those more in the know is: could you create and host this software independently of the service hosting the data (OSF/PARADISEC/etc.), by pulling the data via an API? If so, then the implementation of the software wouldn’t be directly dependent on buy-in from the archives/repositories.

pathall · June 19, 2020, 8:08pm

This is such an interesting thread, thanks for your thoughts!

It really is strange how we’ve come to this juncture where we have archives full of excellent documentation, but that documentation is, more often than not, stored as .eaf files. It’s just not a presentation format. The thing is, ELAN isn’t really a presentation application, either. So we go to the archive, if we manage to log in, often we just download eafs and media files, and… then what? Load them in ELAN? To be quite frank, it’s weird.

Web component applications

I think we should target innovating on the client-side, with “browser applications”. Not app-store style “apps,” but actual applications. The capabilities of browsers have blown up over the past few years. They’re cross-platform, cross-device, well-tested, accessible, extensible, open source, etc etc. Unlike Python or R, the browser has built in support for advanced layout, graphics, typography, media, and even recently media processing.

What are some criteria for such applications?

They need to be simple. We need to bite off things we can chew.
They need to be constrained. A simple application that does just a few things, or even one thing, has a better chance of succeeding and persisting.
They need to be composable. We should think in terms of composing simple applications into more complex ones, rather than starting with a “kitchen sink”.
They produce and consume simple standardized data. The data model should be
They should have few dependencies (a handful of JS and CSS files, not massive libraries that have to be maintained).
They should be built using familiar terminology — really, what the heck IS a “linguistic type” in ELAN? The world may never know.
They should be as archivable as the data they produce and consume — in the web platform, it is possible to do everything with straight up text files (except for media, of course). If it’s no big deal when some .psfx files creep into an archive along an .eaf file, then why should it be a big deal to include play-interlinear-text.html, play-interlinear-text.css, play-interlinear-text.js, as well as interlinear-text.xml or (better, IMHO) interlinear-text.json?

The road I’m trying to go down involves using a simple flavor of the Web Components standard, which enables us to create custom HTML elements. That means that there can be an “on-ramp” which consists only of markup. Imagine the aforementioned play-interlinear-text.html consisting of nothing but this:

<!doctype html>
<html>
  <head> 
    <title>Some text</title>
    <link rel="stylesheet" href="interlinear-text.css">
  </head>
  <body>
    <interlinear-text src=some-interlinear-text.json></interlinear-text.json>
    <script src=interlinear-text.js></script>
  </body>
</html>

Then you put these things into a folder:

some-text/
  play-interlinear-text.html
  play-interlinear-text.js
  play-interlinear-text.css
  some-interlinear-text.json
  some-interlinear-text.wav

A set up like this encourages innovation, I think, because it would be possible to design different applications that could do different things with the same some-interlinear-text.json.

What about servers?

Now, that folder has a lot of the criteria that Bird & Simons (2003) advocated. Putting that on the web involves nothing but a plain web server (an HTTP server). More capable servers are good for indexing large corpora, but running server code is expensive, difficult to replicate, and can result in data silos. Simple HTTP servers are commoditized, you can get a basic, free one running without too much trouble.

This kind of approach could conceivably be one answer to the kind of problems with archiving ongoing research that @inigmendoza was was talking about.

Building such things isn’t that complicated. The (awesome) CORAAL archive that you (@rgriscom) brought to our attention has the granular linking problem we discussed, but a simple approach like the one above can solve both that problem and playback.

Try clicking the text of one of the lines in the demo below, it should play:

Of course, now I’m squeezing it into an <iframe> to force it to work in this discussion site! But a direct link works as well, I put up a page here:

https://docling.net/coraal/player.html#l_352

Anyway, that’s not a proper implentation (it’s not actually built with the web components I was talking about, yet), but the effect would be similar.

Bird, S. & G. Simons. 2003. Seven dimensions of portability for language documentation and description. Language 79(3). 557–582.

lgessler · June 20, 2020, 10:59pm

could you create and host this software independently of the service hosting the data (OSF/PARADISEC/etc.), by pulling the data via an API

As long as the people hosting the data were willing to make it available, yes, there’d be no technical barrier to having the interface be somewhere else. Just like you say, the application could use an API provided by the data’s host.

The road I’m trying to go down involves using a simple flavor of the Web Components standard, which enables us to create custom HTML elements. That means that there can be an “on-ramp” which consists only of markup. Imagine the aforementioned play-interlinear-text.html consisting of nothing but this:

I agree web components seem attractive because they’re very easy for programmers of any skill level to pick up. A problem with web components, though, is that if you want to use them in an app with a state management library then it can be unclear how to make them work together (if, e.g., you’re using a framework like React and your state is being stored on components or a Redux object and not directly in HTML). I thought about this a while back and wrote some proof of concept React components that tries to both be accessible for the former, simple use-case, while still being ready for use with a React app with sophisticated state management, and I was reasonably pleased with the results.

You can clone the repo and run the app if you want all the details, but the TLDR is that the component lets you write this directly in HTML:

<lx-token-list>
  <lx-token form="bonjour" gloss="hello" />
  <lx-token form="le" gloss="the" />
  <lx-token form="monde" gloss="world" />
</lx-token-list>

Screenshot from 2020-06-20 18-44-39

lx-token-list and lx-token are backed under the hood by the React classes TokenList and Token, which can be used directly in React code. The Token component has two additional props which consumers can use to tell it what should happen whenever a token’s form or gloss is modified, e.g. in order to tell a state manager about the new value(s) for the token.

Below is an example that demonstrates this, where tokens’ state is being managed in a DemoDataSource component’s state.tokens property instead of living directly on the Token’s DOM elements. This has exactly the same result as above for the user, and it required no changes to the Token and TokenList components, but state is now being managed in a way that’s more compatible with React.

class DemoDataSource extends React.Component {
  constructor(props) {
    super(props);
    this.state = {
      tokens: [
        {form: 'bonjour', gloss: 'hello'},
        {form: 'le', gloss: 'the'},
        {form: 'monde', gloss: 'world'},
      ]
    }
  }

  render() {
    const modifyToken = (i, key, newVal) => {
      const newTokens = [...this.state.tokens];
      newTokens[i][key] = newVal;
      this.setState({tokens: newTokens});
    }
    return (
      <TokenList>
        {this.state.tokens.map((token, i) => (
          <Token
            key={i}
            form={token.form}
            gloss={token.gloss}
            onFormChange={(newForm) => {
              modifyToken(i, 'form', newForm);
            }}
            onGlossChange={(newGloss) => {
              modifyToken(i, 'gloss', newGloss);
            }}
          >
          </Token>
        ))}
      </TokenList>
    )
  }
}

(edit: that was a rather long-winded way of saying I’ve re-discovered the state hoisting React pattern and found that it makes it easy to package a React component in a web component.)

That’s all to say, if anyone’s going to develop React components for common varieties of linguistic data, they should look into how much work it’d take to package their components as web components, because it might not be too much work

rgriscom · June 22, 2020, 9:21am

Thanks guys for this great discussion. It sounds like everything we’ve discussed is actually possible, we just need to make it What would be the next step in developing a prototype interface? I’d be happy to help in whatever way I can. I have a public repository on OSF with some EAF and TextGrid files if anyone would like to try out using the OSF API, for example: Omaiyo Language Resources

Also, I heard back from Kalinda Pride and she said that they will try to incorporate a (hyper)linking feature for individual annotations into LingView in the future, so we have that to look forward to, as well.

lgessler · June 22, 2020, 10:41pm

So, I guess what we’re imagining is a web component that can take an .eaf file and render it as view-only HTML, right? That’d be a fun project, and it’s small enough to be easily attainable, I think. My summer time is pretty flexible so maybe you, @pathall, and I could chat sometime about making this happen.

edit: looks like LingView used React, so packaging them into web components might be as simple as modifying their code

Andrew_Harvey · June 24, 2020, 12:26am

Great thread, you three!
I admit, I’ve been following in the background, mainly because the skill set required to be active here is pretty far beyond what I currently know.
With that said, I’d be very interested in providing a support role if something like that is needed. I currently curate one pretty large archive deposit with ELAR, and am very interested in making this more accessible.

rgriscom · June 24, 2020, 8:06am

@lgessler Cool, so if the LingView team is already working on incorporating this idea and they are already using React, then would it be easier to simply support them in developing these features for the time being?

Some additional thinking out loud:
As I understand it, the (very) long-term goal would be the integration of such an interface into the archives/repositories themselves, because it would then be possible to have an in-text PDF hyperlink that has more or less the same level of persistence/long-term preservation as the data itself. It is at that stage that in-text linking of cited data could really take off.

For the purpose of developing the idea, though, the interface can be hosted independently and linked to a repository with an API. Any links to such an independent interface will only function as long as that interface is still up and running, but it can be used to demonstrate to archives and the linguistics community that such a thing is possible and should be integrated into our data citation standards. I wonder if it would be helpful to contact those who worked on the Austin Principles for linguistics data citation, to get some buy-in from those who are already invested in such things?

I imagine that such a feature would be of use for community-oriented resources, too. A dictionary with examples from natural speech comes to mind.

@Andrew_Harvey Yes it would be great if we could get someone at ELAR to work on this once the proof-of-concept is complete. The archives have an incentive to develop such an interface, because it has the potential to significantly increase the number of interactions with the data they are hosting.

Andrew_Harvey · June 24, 2020, 5:59pm

I think @laureng was involved in the creation of the Austin Principles, and might have something to say here (or know someone who does).
I agree that you’d probably want a proof-of-concept before approaching the ELAR, but, like @rgriscom says, I think they’d be very receptive to something like this.

laureng · June 24, 2020, 10:04pm

Like you @Andrew_Harvey I’m enjoying this thread without any technical expertise to really have any input. Having archived with both Paradisec and ELAR, the ability to generate data citation at the granularity of the sentence is still mostly the onus of the researcher, I still generate them manually to each example.

pathall · July 1, 2020, 12:46am

Thanks everyone for contributing here. I wrote some somewhat-related general thoughts over here.

I have been thinking so much about all the recent threads here that I have kind of frozen up, too much to say! I’m going to just go ahead and break that seal and try to pop in some small-scale thoughts…

Also, I really hope we can get lots of separate topics! I would hate to see a great idea for an app (for instance) fade away without discussion just because it’s deep in another interesting discussion. I hope everyone feels free to start topics! Site tip: there is a “reply as linked topic” option that you can get to by clicking the arrow in the top left of the editor. Then you can peel something off into a standalone-thread while still (automatically) leaving a pointer at the first reference. I’ll be trying that a lot more myself, let’s see if it’s useful!

There are so many ideas here, all of them are worthy of more discussion, I’m just going to make a little laundry list here:

@lgessler’s example of a React application
Whether web components are a good path
@rgriscom’s mention of the OSF — after looking a bit more, I think this definitely warrants a topic of its own, and could be a reasonable archival target for a situation like @inigmendoza’s.
Data citation has come up repeatedly.
The Austin Principles — I wasn’t aware of these, myself.
The “who’s technical?” question — again, something worth exploring (and expanding!) as a group.

I would like to foreground the question of data, because I think data questions inform application design and implementation questions.

We need to standardize data formats that, at the very least, can handle granular and resolveable references to:

Time-aligned interlinear texts — Texts containing something like “sentences” (or “lines” or “utterances” or wahtever you want to call them) with morphologically analyzed and glossed words (potentially with their own timestamps), as well as optional additional labels such as language, speaker, etc
Grammatical categories — This one doesn’t get enough attention, I think. You and I “just know” that the abbreviation NOM can mean not only “nominative”, but also nominative case. In other words, a lot of the terms with which we are so familiar come in “category/value” pairs. Actually encoding such facts in one place in the documentary database is really important, and enables all kinds of cool interactions with our documentation.
Lexical materials — The words. Anything that we don’t want to repeat for every token should be recoverable from a lexical entry. And those entries can be as baroque as the linguists involved require, as long as there is an agreed-upon way to identify the word that each entry requires. I think this can be done without opaque identifiers (“Universally unique identifiers” or UUIDs, to use the lingo), instead using forms and glosses as a sort of “compound” identifier. I would love to talk more about this kind of thing here.

I would go so far as to suggest that the data format is more important than any one particular application, no matter how powerful that application may be. After all, if we have a fairly standardized data format, then we can imagine lots of applications pipelined together for various and sundry reasons, passing data in the standard format along like a hot potato.

The design of such a format does not need to be an enormous undertaking. The very fact that everyone agrees that Flex and ELAN are in a dysfunctional relationship is itself evidence that our community already has an idea of what data they want and need to be able to manipulate.

My personal preferences for a way to store such data is to use JSON, not XML, since it’s so easy to read and understand for humans, and so trivial to parse in pretty much every programming language. It would is also reasonable to write importers to and exporters from JSON to existing formats such as .eaf or flextext or whatever, where useful.

So… this still feels like a rather rambling post but I’m hoping if we all ramble together we can start to narrow in on actionable collaborations!

lgessler · July 2, 2020, 9:50pm

You’ve raised some really important points here, Pat. I don’t have much to add since I agree with you on pretty much everything (especially on using something in the JSON family and not XML for a next-gen format). I just want to raise one question, which is how a new format would get adopted. EAF and FLEx XML have serious momentum behind them, as they’re used not only by these apps but also archives and other entities. That of course shouldn’t at all stop us from thinking about ways in which they could be better, but it should make us wonder how a new format, once introduced, could possibly gain “market” share. (FWIW, my bet’s that it would take nothing less than a new app displacing ELAN and FLEx in popularity and staying there for several years.)

Grammatical categories — This one doesn’t get enough attention, I think. You and I “just know” that the abbreviation NOM can mean not only “nominative”, but also nominative case . In other words, a lot of the terms with which we are so familiar come in “category/value” pairs. Actually encoding such facts in one place in the documentary database is really important, and enables all kinds of cool interactions with our documentation.

I wonder if this is too hard to be done. Like you say, glosses pull from a shared vocabulary, and this gives the impression that the analyses they are used in must be homogeneous, when in fact what someone meant by ERG or IPFV could vary quite a bit. An existing convention is for writers to give a manifest at the beginning of their works describing each tag’s meaning–perhaps a data format using glosses could use some kind of global identifier like a DOI to achieve something similar, though this would rely on a catalog of all glosses ever used (!!).