A JSON-ized version of the Leipzig Glossing Rules

pathall · July 5, 2021, 5:30pm

Apropos of not much, I made this little repo with the examples from the Leipzig Glossing Rules represented as JSON. It’s just a start, but I think it’s worth thinking about as a test case for representing glossed data.

Original:

https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Here’s the repo, basically just one file for now! Maybe we could move it under our Github organization.

Any thoughts on this?

rgriscom · July 6, 2021, 4:40pm

Nice! I’m curious, would it be feasible to have morpheme-level form-gloss pairs rather than word-level pairs? What might you do for infixes, non-concatenative morphology, etc.?

It raises this larger question of how useful we want our published example data to be for secondary research purposes. Do we want to go the extra mile so that someone can more easily repurpose our data? Of course, if our publications were simply extensions of our databases, then it might not require any extra steps.

Also, it would be great to see this rendered with HTML/CSS

mcswell · July 6, 2021, 7:00pm

Rule 2 of the Leipzig rules covers morpheme-level glossing, and rule 9 covers infixing. Various rules talk about different kinds of non-concatenative morphology, e.g. rule 4D is for stem ablaut as a signal of a grammatical property.

pathall · July 6, 2021, 8:52pm

I’m going to move this under our org, should have done that in the first place. I went ahead and changed it:

So the URL has changed. Happily, it seems to redirect automatically.

Anyway, morphemes:

Indeed, that is a desideratum for sure. One of the reasons I wanted to get this into JSON was to encourage conversation on this very topic — how should the data model cover morphemes?

At the simplest level, each word could have a morphemes array. So if we consider a single Turkish word from example 6:

{
  "form": "çık-mak",
  "gloss": "come.out-INF"
}

…then the two morphemes would also be reasonably modeled as something like:

{
  "form": "çık-mak",
  "gloss": "come.out-INF",
  "morphemes": [
    { "form": "çık", "gloss": "come.out" },
    { "form": "mak", "gloss": "INF" }
  ]
}

This is kind of redundant in that that information is already “there” in the form/gloss values in the original word, but the latter representation is more explicit, and it’s easy to imagine scenarios where you want to annotate an individual morpheme for some reason.

Things get more complicated quickly, as you point out. Even just using = to indicate a clitic can be ambiguous — which of the two morphemes is the clitic and which is something else? This is one possibility, I guess?

{
  "form": "palasi=lu",
  "gloss": "priest=and",
  "morphemes": [
    {
    "form": "palasi",
    "gloss": "priest",
     },

    {
    "form": "lu",
    "gloss": "and",
     "type": "clitic"
     }

   ]
}

Another, maybe more general, approach would be to add some sort of classifier

Great questions. Maybe we could make GitHub issues for each of the rules and try to develop a morphemes array for every word in all the examples?

What do you think of these two principles as criteria for a complete data model of Leipzig notation?

Given an array of morpheme objects, it should be possible to generate the form and gloss strings correctly.
Vice versa.

But this assumes that Leipzig is unambiguous and as with the clitic example I’m not sure that’s the case.

rgriscom · July 7, 2021, 4:35pm

Sentence level
Word level
Morpheme level

I’m curious, if you can include both the word-level form-meaning pairs and the morpheme-level form-meaning pairs in the same data structure, how important is it that you be able to generate one from the other? There is already redundancy in including both the sentence-level form-meaning pair together with the word-level form-meaning pairs, for example.

My impression is that if you can include all three levels, then you really only need to decide on the linear ordering of the morphemes in the morpheme array and the set of morpheme types. In flextext files, for example, only the word and morpheme levels are used (but with sentence-level free translations). Here you can see an example sentence (“phrase”) with word-level and morpheme-level form-meaning pairs (“txt” is form, “gls” is meaning - usually towards the end of the nested section), and only free translation on the sentence/phrase level (“gls” way at the bottom).

<paragraph guid="58453edc-7d09-493a-99e5-5ebb78f21dd4">
        <phrases>
          <phrase guid="8f8a405c-6e49-4297-a4d9-2e88e5a11c0b" begin-time-offset="13650" end-time-offset="14423" speaker="A" media-file="deec1ed7-b935-4616-bd67-77ed3bac737f">
            <item type="segnum" lang="en">1</item>
            <words>
              <word guid="a93d7506-95c4-4137-b013-a79a5d0c1fe4">
                <item type="txt" lang="hts">kwami</item>
                <morphemes>
                  <morph type="stem" guid="d7f713e8-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">kw</item>
                    <item type="cf" lang="hts">kwa</item>
                    <item type="hn" lang="hts">1</item>
                    <item type="gls" lang="en">COND.AUX</item>
                    <item type="msa" lang="en">&lt;Not Sure&gt;</item>
                  </morph>
                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">-ami</item>
                    <item type="cf" lang="hts">-ami</item>
                    <item type="gls" lang="en">3.M.PL.SBJ.POST</item>
                    <item type="msa" lang="en">Attaches to any category</item>
                  </morph>
                </morphemes>
                <item type="gls" lang="en">COND.AUX</item>
              </word>
              <word guid="f77c5797-6422-48ad-95f9-c9dee3f3efc6">
                <item type="txt" lang="hts">zzutchibisa</item>
                <morphemes>
                  <morph type="root" guid="d7f713e5-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">zzutchi</item>
                    <item type="cf" lang="hts">zzutchi</item>
                    <item type="gls" lang="en">wind</item>
                    <item type="msa" lang="en">n</item>
                  </morph>
                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">-bi</item>
                    <item type="cf" lang="hts">-bii</item>
                    <item type="gls" lang="en">M.PL</item>
                    <item type="msa" lang="en">Noun</item>
                  </morph>
                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">-sa</item>
                    <item type="cf" lang="hts">-sa</item>
                    <item type="hn" lang="hts">1</item>
                    <item type="gls" lang="en">3.F.SG.POSS</item>
                    <item type="msa" lang="en">Attaches to any category</item>
                  </morph>
                </morphemes>
                <item type="gls" lang="en">wind</item>
                <item type="pos" lang="en">v</item>
              </word>
              <word guid="e42e728a-8e85-4700-a783-52fa9f91f99b">
                <item type="txt" lang="hts">sanzako</item>
                <morphemes>
                  <morph type="root" guid="d7f713e5-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">sanza</item>
                    <item type="cf" lang="hts">sanza</item>
                    <item type="gls" lang="en">north</item>
                    <item type="msa" lang="en">n</item>
                  </morph>
                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">-ko</item>
                    <item type="cf" lang="hts">-ko</item>
                    <item type="gls" lang="en">F.SG</item>
                    <item type="msa" lang="en">Noun</item>
                  </morph>
                </morphemes>
                <item type="gls" lang="en">north</item>
                <item type="pos" lang="en">n</item>
              </word>
              <word guid="c70f04d6-c71a-4eb3-9b2b-057da455b96d">
                <item type="txt" lang="hts">a</item>
                <morphemes>
                  <morph type="root" guid="d7f713e5-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">a:</item>
                    <item type="cf" lang="hts">a:</item>
                    <item type="gls" lang="en">CONJ</item>
                    <item type="msa" lang="en">&lt;Not Sure&gt;</item>
                  </morph>
                </morphemes>
                <item type="gls" lang="en">CONJ</item>
              </word>
              <word guid="d4071459-ab34-4c38-9fe0-a1e96ff438d4">
                <item type="txt" lang="hts">ishoko</item>
                <morphemes>
                  <morph type="root" guid="d7f713e5-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">isho</item>
                    <item type="cf" lang="hts">isho</item>
                    <item type="gls" lang="en">sun</item>
                    <item type="msa" lang="en">n</item>
                  </morph>
                  <morph type="suffix" guid="d7f713dd-e8cf-11d3-9764-00c04f186933">
                    <item type="txt" lang="hts">-ko</item>
                    <item type="cf" lang="hts">-ko</item>
                    <item type="gls" lang="en">F.SG</item>
                    <item type="msa" lang="en">Noun</item>
                  </morph>
                </morphemes>
              </word>
              <word guid="eb6716fb-cf27-41c3-927a-c6eac7e7df7f">
                <item type="txt" lang="hts">guguruwakee</item>
              </word>
            </words>
            <item type="gls" lang="en">when the north wind and the sun were disputing</item>
            <item type="gls" lang="sw"></item>
          </phrase>
        </phrases>
      </paragraph>

There is also the question of whether or not you want the “underlying form” to be the value at the morpheme level, e.g. rather than a context-dependent allomorph which appears in the example. If so, then you wouldn’t expect to be able to generate the morpheme level from the word level. Perhaps this depends on whether or not you are creating a system for representing text examples from publications vs. creating a system which is designed to represent text examples from a database. Ideally, though, these would be the same system, right?

xrotwng · March 29, 2022, 1:26pm

I guess I’m a bit late to this, but I made a little repo with the examples from the LGR, too cldf-datasets/lgr/ - as CLDF dataset. This has the advantage of standard relations to sources and languages.

I did this to experiment with two projects of mine:

cldfviz.text: So README.md is rendered from README_tmpl.md by inserting data from the CLDF dataset.
pyigt: A python library to parse LGR conformant IGT.

xrotwng · March 29, 2022, 1:26pm

The dataset is at GitHub - cldf-datasets/lgr: The Leipzig Glossing Rules - but the forum limits me to 2 links per post for now …

xrotwng · March 29, 2022, 1:33pm

Regarding the data model for IGTs, pyigt doesn’t try to infer anything, but keeps all the info to do so lateron, so:

>>> from pyigt import IGT
>>> igt = IGT(phrase="a-b=c", gloss="A-B=C")
>>> igt.glossed_words[0].glossed_morphemes[0].sep
>>> igt.glossed_words[0].glossed_morphemes[1].sep
'-'
>>> igt.glossed_words[0].glossed_morphemes[2].sep
'='

pyigt does try to remove a bit of ambiguity from LGR IGTs, though, by distinguishing prosodic_words and morphosyntactic_words - thereby making sense of some of the more exotic features of LGR.