đź“„ Notes on some earlier attempts to digitize interlinear texts (via Maeda & Bird 2000)

Maeda, Kazuaki & Steven Bird. 2000. A Formal Framework For Interlinear Text. https://arxiv.org/pdf/cs/0010033.pdf (23 June, 2021). updated link

This is paper is quite old now (2000), but I think it’s useful for several reasons: for one, it has a convenient inventory of then-existing formats for capturing interlinear text.

“we survey a range of forms of interlinear text used in various software packages and data models”

These are:

  • Printed Texts and Text Formatting Tools This refers to libraries for formatting interlinears, essentially referring to TeX and LaTeX macros. (SIL had a candidate in this space, which seems to linger on only in the Wayback Machine
  • PTEXT – Parsed Text This is an interesting one. There was once SIL software called LinguaLinks, written by Gary Simons, which seems to have been pretty elaborate. You don’t hear about it much any more.
  • Shoebox format At least, that’s what I’ve always called it. Apparently the real name was “SIL Standard Format” — of course, not even SIL thinks it’s a standard any more.
  • LACITO XML Format for Interlinear Text This one is very straightforward:
<HEADER>
  <TITLE>Deux soeurs.</TITLE>
  <SOUNDFILE href="SOEURS.mp2"/>
</HEADER>
<BODY lang="hayu">
  <S id="s1">
    <AUDIO start="2.3656" end="7.9256"/>
    <TRANSCR>
      <W> <FORM>nakpu</FORM> <GLS>deux</GLS> </W>
      <W> <FORM>nonotso</FORM> <GLS>soeurs</GLS> </W>
      <W> <FORM>si?</FORM> <GLS>bois</GLS> </W>
      <W> <FORM>pa</FORM> <GLS>faire</GLS> </W>
      <W> <FORM>la?natshem</FORM> <GLS>all`erent(D)</GLS> </W> <W> <FORM>are</FORM> <GLS>dit.on</GLS> </W>
      <PONCT>.</PONCT>
    </TRANSCR>
    <TRADUC lang="Francais">On raconte que deux soeurs all`erent chercher du bois.</TRADUC>
    <TRADUC lang="Anglais">They say that two sisters went to get firewood.</TRADUC>
  </S>