Flextext to Plaintext

rgriscom · June 2, 2021, 6:26pm

Here is a quick script I made for a colleague to create plaintext data from flextext interlinearized texts. It is a common problem that FLEx does not have a good method for exporting text data for use in publications/presentations, so here is a simple workaround with support for batch processing. The code is quite messy…

The script:

Example FLEx data:

Example output:

pathall · June 2, 2021, 9:08pm

Neat!

Hey would you mind sharing the sample data, even if just a snippet of it? I think it could be interesting to think about this together.

pathall · June 2, 2021, 9:19pm

Oh wait, I found a couple:

https://www.google.com/search?hl=en&q=ext%3Aflextext

Grant · June 3, 2021, 2:12am

This is great! Thank you. Does anyone have any experience manipulating .FLEXTEXT data in R? Is so are there any similar scripts or packages out there?

rgriscom · June 3, 2021, 7:50am

Thanks! @Grant it looks like this might be a R package with support for converting flextext files: phonfieldwork: Linguistic Phonetic Fieldwork Tools version 0.0.11 from CRAN. I don’t think I’ve seen a comparable package for Python.

One outstanding challenge with the script is that there still isn’t a way to make the text naturally wrap when pasted into a document without sufficient space on a single line, but as far as I know there is no way to automate that with Word/LibreOffice Writer/Google Docs. This is one of the reasons @pathall has been promoting the use of web technologies, because they make it possible to dynamically wrap multiple lines of text based on the size of the window.

So, this script is really just a temporary (and admittedly flimsy) band-aid until we find a more robust solution!

I will probably give it at least one more update, as I want to learn how to use a XML parser instead of processing the code line-by-line, and there are still some errors with reduplication and compounds.

pathall · June 3, 2021, 3:31pm

I think this would make your life soooo much easier with projects like this. My friend (and local Python guru) @BrenBarn suggests taking a look at lxml as well as elementtree, but either one should be a good start.

Incidentally (you probably already know this, but since we’re talking about it), the output of an XML parser is called a “DOM” — document object model. Learning to manipulate a DOM (golly, that sounds so… modern…) is a really useful skill that also applies to the HTML/browser world. A primary function of Javascript is to build a DOM from an HTML page and make each of the nodes (which correspond to HTML elements) available for processing.

Also, your post got me to thinking about how JSON fits into this conversion business, so I spun off a rather lengthy JSON in the Middle post that might be of interest.

rgriscom · June 7, 2021, 4:09pm

I reworked the script with elementtree and added support for roots + compounds. Still some basic features missing (e.g. no support for infixes).

Working with elementtree definitely simplified the code quite a bit, although I also ran into the issue that the FLEx XML standard includes tags with varying degrees of specificity. Some tags such as <paragrap>, <word>, or <morph> are quite straightforward, but many of the glosses, morpheme forms, and free translations all have the somewhat ambiguous tag <item>. This makes it necessary to do some additional work to distinguish different tags with the same parent but different attributes.

pathall · June 9, 2021, 12:33am

Not sure whether this is of particular interest but I have this old code for importing (a particular flavor of) flextext to JSON:

flextextToJSON.js

class FlexTextParser {
  constructor({dom}){
    this.dom = dom
  }

  get text(){
    return {
      metadata: {},
      sentences: this.sentences
    }
  }

  get sentences(){
    let phraseNodes = Array.from(this.doc.querySelectorAll('phrase'))
    let sentences = phraseNodes.map(phraseNode => this.phraseNodeToSentence(phraseNode))
    return sentences
  }

  phraseNodeToSentence(phraseNode){
    let children = Array.from(phraseNode.children)

    let orthographic = children.find(el => el.matches('item[type="gls"], item[type="punct"]')).textContent
    let translation = children.find(el => el.matches('item[type="lit"]')).textContent

    let words = Array.from(phraseNode.querySelectorAll('word'))
      .map(wordNode => this.wordNodeToWord(wordNode))
      .filter(word => word.form && word.gloss)
      
    let transcription = words.map(word => word.form).join(' ')
    let sentence =  {
      orthographic,
      transcription: orthographic,
      translation,
      words
    }

    return sentence
  }

  isValidMorphNode(morphNode){
    return morphNode.querySelector('item[type="gls"]') 
        && morphNode.querySelector('item[type="txt"]')
        && morphNode.getAttribute('type')
  }
  
  parseMorphemesNode(){
    // returns an array of Words
    return Array.from(this.doc.querySelectorAll('morph'))
      .map(morph => {
        let type = morph.getAttribute('type') // suffix, stem (prefix)  
        let form = morph.querySelector('item[type="txt"]').textContent
        let gloss = morph.querySelector('item[type="gls"]').textContent

        if(type == 'suffix'){ gloss = `-${gloss}`}
        if(type == 'prefix'){ gloss = `${gloss}-`}

        return {
          form,
          gloss
        } 
      })
  }

  morphNodeToMorpheme(morphNode){  
    let type = morphNode.getAttribute('type') || null // suffix, stem (prefix)  

    let form = morphNode.querySelector('item[type="txt"]').textContent || ""
    let gloss = morphNode.querySelector('item[type="gls"]').textContent || ""

    let morpheme = {
      form,
      gloss,
      type
    }

    if(type == 'suffix' && gloss){ gloss = `-${gloss}`}
    if(type == 'prefix' && gloss){ gloss = `${gloss}-`}

    return {
      form,
      gloss
    } 
  }

  wordNodeToWord(wordNode){
    let morphNodes = Array.from(wordNode.querySelectorAll('morphemes morph'))
    let morphemes = morphNodes
      .filter(morphNode => this.isValidMorphNode(morphNode))
      .map(morphNode => this.morphNodeToMorpheme(morphNode))

    let word = morphemes.reduce((word, morpheme) => {
      word.form += morpheme.form
      word.gloss += morpheme.gloss
      return word
    }, { form: "", gloss: ""}) 

    return word
  }

  toJSON(){
    return this.text
  }
}

// Here’s how you would use it on a flextext:

new FlexTextParser({doc:

This is by no means great code… not sure if using a class makes sense, for instance. But whatevs, it solved my problem at the time.

But of course it doesn’t work unless the fields are just so. The problem is, as you mention, the fact that the labels for things don’t seem to be consistent. For example, I just randomly found some flextext online and tried my parser on that… it failed of course.

https://fiona.uni-hamburg.de/1860492f/silp1981stonyoldwomanflk.flextext

The morph type, for instance, is quite different from what I had been working with:

                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="qaa-x-aaa">-ɨ</item>
                    <item type="cf" lang="qaa-x-aaa">-ɨ</item>
                    <item type="gls" lang="ru">EP</item>
                    <item type="gls" lang="en">EP</item>
                    <item type="msa" lang="en">infl:ins</item>
                  </morph>

There are two nodes that match item[type="gls"]… so I guess they are to be distinguished by their lang tag…

I dunno. It’s complicated. This is kind of what drives me bonkers about XML to be honest. It’s not that data in XML isn’t well structured, it is. It’s just that it isn’t very self-describing as far as programming languages are concerned. I recognize that we need to be able to parse XML moving forward so we can build on existing documentation, but when I have the choice, personally, I much rather deal with JSON in a loosey-goosey way. Once there is a data structure available (objects and arrays, basically), it’s easy enough to poke around and see what is actually there.

I may be mistaken but it sems quite difficult to write a general FlexTextToJSON function.

rgriscom · June 9, 2021, 10:40pm

Thanks for this Pat. This reminds me that a good next step for my script (and my programming in general) would be to incorporate functions

Yes, the problem with item[type=“gls”] arises when there is more than one gloss, which is often the case for anyone working in a region with a lingua franca that is not English. Simply further specifying for lang=“en” should suffice.

It looks like in your script you parsed free translations by matching for item[type=“lit”]. I don’t think I’ve seen this attribute value in flextext files before - perhaps this is an older flextext standard?

You are right that there are many features of FLEx which are probably used at a low frequency, and incorporating all of them into a conversion script would take some time. But it would be great if someone did it!

I’ll add another comment on this over in the JSON in the Middle thread.