The Moro Database: A client-side documentation interface

pathall · April 25, 2022, 3:59pm

I have been poking around in this site from Berkeley on the Moro language, it’s really cool:

https://linguistics.berkeley.edu/moro/

If anyone here knows the folks involved in this project, it would be great to hear more from them!

I’d like to try to explain some of the technical features of the construction of this site, and why they are worth thinking about for documentary linguists. Even if you don’t consider yourself a “technical person”, I hope you will keep reading anyway. Firstly, because every linguist is a technical person! And secondly, because understanding technology better is always empowering — if you have a good grasp on how the web works, you’ll be in a good position to make informed decisions on how to make maintainable, sustainable web documentation.

From the perspective of a visitor to the site, there doesn’t seem to be anything particularly unusual about the design of this site. The front page has some intro text, a picture of the region, and a list of the participants in the project. The navigation at the top links to a “texts” page, a “concordance” page, and a “search” page. Let’s look at each of these in turn.

Whirlwind tour

Texts

https://linguistics.berkeley.edu/moro/#/text

A nice simple catalog of texts here, titles and speakers.

Here’s what a single text looks like:

You can toggle glosses:

And there’s also an interesting option to read the story in “side-by-side” or “parallel” view:

Concordance

There’s a very nice searchable concordance interface which looks like this:

I’m not sure what the relationship of forms and glosses here for morphemes; the last one in the the screenshot above is glossed as a -ipfv, so those might be allomorphs or distinct morphemes. Clicking through (randomly) on -ia gives us a sidebar that looks like this — 50 results, cool:

So the concordance is searchable: we can match glosses (labeled “English”) with a wildcard search like this:

There is a minor search issue…

> There is a lurking problem in this implementation: because glosses are matched as strings, it’s not possible to distinguish a gloss which happen to be a prefix of some other gloss. So for instance, a “Contains” search over the glosses can’t distinguish a search for the grammatical category abbreviation `ap` ‘antipassive’ from `appl` ‘applicative’, or indeed from strings which happen to be in the translated bits of glosses — it will match ‘grapple’, ‘wrap’, etc.

Search

Finally we have a global search interface, which matches anywhere in any of the fields in a sentence. This is useful, for instance, if you want to search the orgthographic tier or the gloss tier. Thus, the complex form spelled nǝyaborṯwe can be matched by directly querying nǝyaborṯwe:

But perhaps you are interested in a sequence of morphemes that happens to occur in that form, let’s say, nǝ-y- ‘comp2-cly-’ (which abbreviate complementizer 2 + Noun class agreement/concord: class n (pl)). This search matches across several

(I used the Firefox search interface’s nifty “highlight all” to highlight the matches.)

How the site works

This site is almost entirely generated dynamically by the browser.

What the heck does that mean?

Well, you probably know that web sites are “made out of” HTML files. HTML is just a plaintext format where you “mark up” different bits of your document using <tags> <like> <these>. Except that they are tags that mean things: so a <p> is a paragraph, an <a> is an “anchor” or hyperlink, and so forth.

See more info on HTML here.

But if you “view source” on the Moro web site, this is what you see:

<!-- index.html -->
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>Moro Database</title>
    <link  rel="stylesheet" type="text/css" href="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.1.4/semantic.min.css"/>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.0.0/lodash.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/react/0.13.3/JSXTransformer.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/react/0.13.3/react.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/react-router/0.13.3/ReactRouter.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.1.4/semantic.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/URI.js/1.17.1/URI.min.js"></script>
  </head>
  <body>
    <div id="content"></div>
      <script type="text/jsx" src="utils.jsx"></script>
      <script type="text/jsx" src="moroScript.jsx"></script>
  </body>
</html>

What the heck?? Where’s the stuff?

Well, note this paragraph from the about page:

The website, including the concordance, is generated from a json file from a single script developed by Hannah Sande, Marcus Ewert, and Maytas Monsereenusorn, available on Github. The development of this corpus was supported by a grant from the Hellman Fellows Fund.

So what does that mean? It means that this is what happens when you load https://linguistics.berkeley.edu/moro/:

Load the skeleton `HTML`

Your web browser (Firefox, Chrome, Safari, whatever) reads the HTML content excerpted above. So far, there’s nothing to see in the actual browser window.

Load the linked Javascript programs

When the browser gets to all those <script> tags, it loads the Javascript programs specified in the src attributes — so it’s actually going to load all these Javascript files:

Now, if you click through those files you’ll see some scary-looking stuff that has nothing to do with linguistics. That’s because the site is built with what are called Javascript libraries — in this case, things like “React” and “Lodash” and lots of other stuff that won’t mean much to you at all if you’re just starting to learn about web development.

However, I do encourage you to take a look at the last link:

http://linguistics.berkeley.edu/moro/moroScript.jsx

This is the script that the authors are talking about in the quote above: the stuff that’s specific to the Moro language project (and it makes use of all the other libraries like React and so forth). Reading other peoples’ code can be a challenge (I don’t understand this stuff, because I haven’t learned React!), but one thing you can do is just look at the comments. In Javascript, comments can begin with to forward slashses. Here are all the comments in moroScript.jsx:

      //Bottom of this doc sets up page structure and references components created above
      //Global variable for moro_click database
      //These are imports from ReactRouter o.13.x
      //docs: https://github.com/rackt/react-router/blob/0.13.x/docs/guides/overview.md
      // These are endpoints to load data from.
      // Loaded from static files in the repository rather than from lingsync.
      // Static file with sentences.
      // Static file with stories.
      // Promise that is resolved once the sentence data is loaded
      // Promise that is resolved once stories are loaded
      //===========================================Dictionary Code===========================================
      //get id of all occurrences of the morpheme and definition pair from the global_id_to_morpheme_definition
            //{sentence_id:dirtydata.rows[i].id, utterance_match:sentence.utterance, morphemes_match:sentence.morphemes, gloss_match:sentence.gloss, translation_match:sentence.translation});
        //console.log(results);
      //Segments a word into morphemes with glosses; morphemes from 'word' argument, glosses from 'glossword' argument
        //if there is not the same number of dashes we aren't aligning the correct morphemes and gloss
        //identify verb roots so we can distinguish prefixes from suffixes
          //all verb root morphemes end with .rt or .aux
          //TODO: does this include be.loc, be.1d, be.2d, etc? @HSande for details
        //iterate over morphemes; if there is a verb root, add pre-dashes to suffixes and post-dashes to prefixes: 
        //example: g-a-s-o; clg-rtc-eat.rt-pfv = [g-, a-, s, -o]; [clg-, rtc-, eat.rt, -pfv]
          // Remove punctuation, make lower case, and replace all "Latin Letter
          // Small Schwa" characters with "Latin Letter Smal E" characters, so
          // there is just one schwa character in the corpus. 
    //merge two arrays and de-duplicate items
    //remove duplicate items for click morpheme_definition_pair_list
    //Remove punctuation from string excluding dashes and period in word
    //Process dict with count to sorted dict without count value
            // split on spaces and remove punctuation from morphemes line
                //process all morphemes and words
                //remove duplicate pair 
                //add the morpheme definition pair list for each sentence into the global variable
    //Print out result dict
    //console.log(JSON.stringify(results))
    //console.log(JSON.stringify(global_id_to_morpheme_definition))
    //console.log("DONE")
    //return morphemes/glosses by moro morphemes
      // This is a test for processing code
      //test_processdata();
      // promise that resolves when sentence data is loaded and processed into morpheme dictionary
      //Dictionary viewing code
      //ReactClass for rendering a definition
      // ReactClass for rendering many definitions
      //SEARCH CODE
      //matchSearchFunc for definition to searchTerm (EngPlain)
      //matchSearchFunc for definition to searchTerm (EngRegex)
      //matchSearchFunc for moroword to searchTerm (MoroPlain)
      //matchSearchFunc healper for moroword to searchTerm (without regrex)
      //matchSearchFunc for moroword to searchTerm (MoroRegex)
      //matchSearchFunc healper for moroword to searchTerm (with regrex)
          // if (categories[i] === moroword) {
      // React container for rendering 1 page of dictionary entries, with a
      // header and footer for page navigation.
          // TODO: We might have to compute the alphabet on-demand here, since
          // our skips are going to be wrong.
      // React container that will show a loading dimmer until the dictionary data is available; then renders definitions
            // Find the first index of each letter, grouping numbers.
      // Dictionary view with concordance.
//===================================================Text Page==================================
      // React Class that renders list of stories with links to story content pages (w/loading dimmer)
      // A component to render a single sentence.
          // interlinear gloss alignment
            // render one inline block div containing morpheme and gloss per word
          // render utterance and translation
      //React Class for a single story view
        //React object state
        //
        //sentence: loaded flag and sentence data
        //story: loaded flag and story data
        //show_gloss: flag true if we show interlinear gloss lines
        //queue uploading of story and sentence data when this component is mounted
        //only ready to display story when story and sentence data have loaded
        // Get the story object
        //return name of story by searching story data for this story's id
        //return author of story by searching story data for this story's id
        //toggles interlinear gloss or not
        //toggles story view
        //renders component
          // If we haven't loaded yet, just render the dimmer.
          // process sentence data to render alignment of morphemes/glosses and show one clause per line
          // lodash chaining: https://lodash.com/docs#_
            // render sentences from this story
              // how to render a sentence
          // render story content page with title and checkbox to toggle interlinear gloss display
//=========================HOMEPAGE===============================
//=========================GLOSS PAGE===============================
//=========================Search Page===============================
        //queue uploading of story and sentence data when this component is mounted
        //only ready to display story when story and sentence data have loaded
      //render page template using ReactRouter: https://github.com/rackt/react-router/blob/0.13.x/docs/guides/overview.md
      // set up routes for ReactRouter: https://github.com/rackt/react-router/blob/0.13.x/docs/guides/overview.md
      // enables the single-page web app design

Obviously lots of this will make no sense to you, but some will — at least you can see where the “dictionary bits” and the “sentence bits” and stuff like that are.

The scripts load the data

Now, the actual data — the stuff in Moro, the documentation — is kept separately in its own files. There are two of them.

One is a metadata index of stories:

{
  "total_rows": 22,
  "offset": 0,
  "rows": [
    {
      "id": "25b07c7b8a735c578cccf0fc5236c0fd",
      "key": "25b07c7b8a735c578cccf0fc5236c0fd",
      "value": {
        "name": "The Crow and Abalimi the deciever",
        "author": "Wesley Suleiman Basher"
      }
    },
    {
      "id": "32a4e729a4c1d2278bec26f69b0740ae",
      "key": "32a4e729a4c1d2278bec26f69b0740ae",
      "value": {
        "name": "What do Moro people do?",
        "author": "Angelo Ngalloka Nasir"
      }
    },
     // lots more here
}

and the other is a big list of sentences:

{
  "total_rows": 3457,
  "offset": 0,
  "rows": [
    {
      "id": "25b07c7b8a735c578cccf0fc5236e502",
      "key": [
        "25b07c7b8a735c578cccf0fc5236c0fd",
        1431754783906,
        "25b07c7b8a735c578cccf0fc5236e502"
      ],
      "value": {
        "story": "25b07c7b8a735c578cccf0fc5236c0fd",
        "sentence": {
          "judgement": "",
          "utterance": "Ajǝŋgwara na Abalimi amǝɽa",
          "morphemes": "Ajǝŋgwara na Abalimi amǝɽa",
          "gloss": "crow comp2 Abalimi deceiver",
          "translation": "The Crow and Abalimi the deceiver",
          "tags": "",
          "syntacticCategory": "N Comp N N",
          "syntacticTreeLatex": "",
          "validationStatus": "",
          "enteredByUser": "angalonasir",
          "modifiedByUser": ""
        }
      }
    },
    {
      "id": "25b07c7b8a735c578cccf0fc5236ff86",
      "key": [
        "25b07c7b8a735c578cccf0fc5236c0fd",
        1431754838487,
        "25b07c7b8a735c578cccf0fc5236ff86"
      ],
      "value": {
        "story": "25b07c7b8a735c578cccf0fc5236c0fd",
        "sentence": {
          "judgement": "",
          "utterance": "Lomanǝŋ pǝnde ram,",
          "morphemes": "Loma-nǝŋ pǝnde ram",
          "gloss": "day-indef past early",
          "translation": "Once upon a time,",
          "tags": "",
          "syntacticCategory": "Adv Adv Adv",
          "syntacticTreeLatex": "",
          "validationStatus": "",
          "enteredByUser": "angalonasir",
          "modifiedByUser": ""
        }
      }
    },
    // tons of more sentences here…
  ]
}

The loading is actually done inside moroScript.jsx. The actual syntax has to do with a weird thing called a Promise and AJAX and other blah blah things that we can talk about some other time. The main point of this whole discussion is that the code and the data are kept separate. The data itself, which is in the JSON data format, I think you’ll agree that the data is pretty easy to understand (aside from the computer-speak-ish ids and keys!).

Your laptop (or phone) is doing all the heavy lifting

So all this business is what makes the Moro Database site a “client side application”. There is a “web server” running at linguistics.berkeley.edu, and that server waits around until it receives a request for a particular URL, like https://linguistics.berkeley.edu/moro into the URL bar of their browser.

It’s the URL that tells that web server what files to return. In fact, you can think of a web server as doing one of two things when it receives a request:

Return a file
Run a program, generate some output, and return that as a file

So what’s interesting about the way the Moro Database site is functioning is that it as far as the server is concerned, it is only doing #1. The conversation goes like this (the “client” is you, on your laptop! The “server” is a computer sitting on a shelf somewhere that is running a web server).

Client:

“Hey, can you send me whatever is at this URL: https://linguistics.berkeley.edu/moro ”

Server:

“Oh yes, let’s see, I have an HTML file called index.html which is associated with that URL, here you go…

The server sends back the HTML way we saw above.

Client:

The browser loads the page and finds all those <script> tags we talked about. The browser knows that it needs to go get those scripts and run them. So it makes more requests for each of those Javascript files, and runs them. When one of those files, moroScript.jsx, is run, it instructs the browser to make still more requests, for the two JSON data files, stories.json and sentences.json.

Server:

Here’s jquery.min.js, lodash.min.js, …blah blah… moroScript.jsx, oh, and here’s stories.json and sentences.json.

Later.

And then the server is donesies. When you run those searches we were talking about above, the server has no idea that you are doing that. There are no more requests going on, all the computer code that is running is running inside your browser. It’s your processor that is doing the search, not the processor on the computer called linguistics.berkeley.edu.

And I should care about all of this because…

Good question, dear interlocutor.

Here’s why:

Maintaining a website of this kind, a client-side application where all the code is embedded in HTML files, is as easy as maintaining a website which is just “static” HTML files.

And that is a very feasible thing to do. In the case of the Moro Database we’ve been looking at, literally everything that you would need to run that site is contained inside a single folder. You “deploy” that folder to a web host — just about any webhost — and it will work. You don’t have to run custom server code on that host (which is complicated, expensive, difficult to maintain…). You just need the server function #1: listen for requests for files, and return them.

There is a lot more to all of this, but I hope that the notion of a “client-side application” is a bit clearer. (Maybe they should be called “browser-side applications”!)

docling.js is also based on this premise, he said, in a footnote.

xrotwng · April 26, 2022, 6:50am

I guess my obligatory reply: Wouldn’t it be nice if this “text viewer” would work with any collection of glossed texts? Then it would work with the Tsez Annotated Corpus, too! And if that were the case, there’d be a cleaner boundary between data and software - and then the data might be more re-usable and possibly longterm-archived with a DOI at a service designed for this (as opposed to being “archived” with GitHub, until Microsoft decides to sell it)?

Hilaria · April 26, 2022, 11:52am

this is lovely. I want a site like for my chatino texts.

Hilaria · April 26, 2022, 11:53am

Thank you so much Pat for walking us thru this.

pathall · April 26, 2022, 4:21pm

Yeah, I agree that we need to get to that kind of interoperability for sure. Actually I was comparing the Tsez data and the Moro data, and the structures used are already pretty close. These files are fairly comparable:

Moro | Tsez

| -
sentences.json | examples.csv
stories.json | texts.csv

stories.json versus texts.csv:

There is some wonkiness in the Moro data (id seems to == key? why use value everywhere?), but as far as the fields that matter:

Moro	Tsez
`id`	`ID`
`name`	`Name`
`author`	-
-	`description`

CLDR field	value
ID	1-1
Language_ID	dido1241
Primary_Text	`Esin šebi xecin šebi zownƛax sis ɣˤana-xediw.`
Analyzed_Word	`esi-n\tšebi\txeci-n\tšebi\tzow-n-ƛax\tsis\tɣˤana-xediw`
Gloss	`tell-PFV.CVB\twhat\tleave-PFV.CVB\twhat\tbe.NPRS-PST.UNW-QUOT\tone\tmarried.couple`
Translated_Text	`What is to be said, what is to be left out -- there was a couple.`
Meta_Language_ID	stan1293
Comment
Text_ID	1
Russian_Translation	О чем рассказать и что оставить, жили были муж и жена.
Part_of_Speech	v-vsuf pron v-vsuf pron v-vsuf-suf num n1pl

Again the Moro db structure is a bit odd, but a lot of the same stuff is in there:

Moro database sentence-level JSON object

{
      "id": "25b07c7b8a735c578cccf0fc5236e502",
      "key": [
        "25b07c7b8a735c578cccf0fc5236c0fd",
        1431754783906,
        "25b07c7b8a735c578cccf0fc5236e502"
      ],
      "value": {
        "story": "25b07c7b8a735c578cccf0fc5236c0fd",
        "sentence": {
          "judgement": "",
          "utterance": "Ajǝŋgwara na Abalimi amǝɽa",
          "morphemes": "Ajǝŋgwara na Abalimi amǝɽa",
          "gloss": "crow comp2 Abalimi deceiver",
          "translation": "The Crow and Abalimi the deceiver",
          "tags": "",
          "syntacticCategory": "N Comp N N",
          "syntacticTreeLatex": "",
          "validationStatus": "",
          "enteredByUser": "angalonasir",
          "modifiedByUser": ""
        }
      }
    }

Moro field	Value
judgement
story	`25b07c7b8a735c578cccf0fc5236c0fd`
utterance	`Ajǝŋgwara na Abalimi amǝɽa`
morphemes	`Ajǝŋgwara na Abalimi amǝɽa`
gloss	`crow comp2 Abalimi deceiver`
translation	`The Crow and Abalimi the deceiver`
tags
syntacticCategory	`N Comp N N`
syntacticTreeLatex
validationStatus
enteredByUser	`angalonasir`
modifiedByUser

So we can compare the fields:

CLDR	Moro
`ID`	-
`Language_ID`	-
`Primary_Text`	`utterance`
`Analyzed_Word`	`morphemes`
`Gloss`	`gloss`
`Translated_Text`	`translation`
`Meta_Language_ID`	-
`Comment`	-
`Text_ID`	`story`
`Russian_Translation`	-
`Part_of_Speech`	`syntacticCategory`
-	`tags`
-	`syntacticTreeLatex`
-	`validationStatus`
-	`enteredByUser`
-	`modifiedByUser`

I bolded the ones that seem aboslutely indispensable (to me, anyway), and those are present in both systems. So given this similarity, it’s totally possible to load up the Tsez data in the Moro viewer.

Gonna try, please hold…

pathall · April 26, 2022, 5:14pm

Hi @Hilaria!

Yes, we should get back to that project we were working on with the interlinearized versions of your storybooks… trying to remember where it was Here, maybe?

Doesn’t seem to be working at the moment

Truly my pleasure! Helping people learn about the web for documentation is my primary fuel.

xrotwng · April 26, 2022, 6:45pm

Of course it’s possible. But it would require “coding the data to the implementation” and not “to the interface” - as I mentioned elsewhere.

pathall · April 26, 2022, 7:16pm

So when you say “interface”, you mean CLDF right?

In other words, it would be nice if the Moro interface were rewritten to accept CLDF data as an input.

But who defines that interface? I think we should expect a process of refinement, maybe something like the web standards process. I for one, for instance, am not convinced that a model based on tabular/CSV data (as opposed to more hierarchical JSON data) is the right starting point for documentary data. That’s because it ends up requiring fields with delimited content. (This kind of thing: esi-n\tšebi\txeci-n\tšebi\tzow-n-ƛax\tsis\tɣˤana-xediw.) A field containing delimiters requires a parser, and now we’re coding data to the implementation again, right?

xrotwng · April 27, 2022, 6:43am

Ok, that’s going to be a long one

First off, yes, I meant it would be nice if the Moro interface was written to accept CLDF data as input. More generally, though, I think any specified, domain-specific data format would be nice. While I get the xkcd point about standards, I’d still prefer the 15th specified standard for interlinear glossed text over the nth “bespoke software bundled with data in its internal format” (having seen databases that died in no-longer readable Filemaker files, or data inaccessible because client-side apps didn’t survive archiving).
So, regarding “who defines this interface”, I’m all for people thinking about data models. But it would be beneficial if these thoughts result in something with a workable specification.

As a developer I got quite frustrated with having to come up with or cater to custom data representations all the time. But then, I’m one of the few software people that have been around for a long time in linguistics. Typically, software careers in academia seem rather short-lived, leading to the proliferation of one-off data models and software tools.
So maybe the gain of standardized data formats is just not that big for most people in linguistics, because learning up on one standard and working around its flaws might well take longer than writing the nth custom SFM or custom ELAN XML parser.

And maybe it’s again this academic environment that’s prohibiting development of good standards, because that takes time as well - ideally time contributed by dedicated people over a longish period. And building a community around a standard takes a long time as well - and community building is hard work as you know

Now to your specific point about CLDF possibly being the wrong choice for “documentary data”. That’s absolutely possible, considering the motivations and design goals behind CLDF. The use cases we wanted to address with CLDF were somewhat well-defined visualization/analysis tasks in Typology and Historical Linguistics such as

plotting typological survey data on a WALS-like map
feeding wordlists into a tool such as EDICTOR for cognate coding
feeding cognate-coded lexical data into phylogenetic algorithms

So as opposed to many other standardization attempts, CLDF is narrower by focusing on data types with somewhat clear-cut automated re-use cases. But CLDF is also fairly encompassing: It’s built on top of CSVW which is basically a serialization format for relational databases (i.e. tabular data plus foreign keys).
Thus, it would be possible to shoe-horn any hierarchical data into a CLDF dataset, albeit mostly in additional CSVW table without specific CLDF (or linguistic) semantics.

But in the particular case of the Moro data (or the Tsez Annotated Corpus), it turns out that the somewhat narrow CLDF semantics are fully adequate to encode the datasets:

the glossed sentences can be modeled as rows in an ExampleTable,
the stories can be modeled as rows in a ContributionTable
and lines can be linked to stories using a contributionReference foreign key.

(Note that ContributionTable wasn’t part of CLDF 1.0, so for the Tsez dataset, a custom, non-CLDF-specified table is used for texts. So there is some “process of refinement” going on for CLDF - although a lot less formalized than the one for web standards.)

This full adequacy also extends to the word separators in the alignments. CSVW allows to specify columns as containing lists of values and CLDF uses this method for word segmented text and gloss. (Admittedly, this method only allows for one more level of hierarchy within table cells, but in this case that’s sufficient, because morpheme-segmentation is specified by the Leipzig Glossing Rules.)

So, wrapping up, I do think that CLDF might be a suitable data format for the Moro Database. I wouldn’t push it as the format for documentary data, though, but would hope that more people see the benefits of well-specified data
formats.

yuni · April 28, 2022, 12:36pm

I know Hannah Sande, so I’ll let her know about this discussion. I’m sure she’ll be happy to see the interest in the Moro page, and maybe she can share some of her thoughts on its development.

pathall · April 28, 2022, 12:45pm

Oh that would be great @yuni!

I’ll send you an invite link that will allow her to log in and bring her directly to this discussion.

Thanks!

faytak · April 28, 2022, 1:41pm

I also know Hannah and I think she is already a member? ‘HannahLS’ or something like that; she is clearly in the intro thread back in 2020.

This is a very interesting approach - would this reduce, in theory, the amount of data set over the network while browsing these materials? That aspect would be really useful if (as the NSF DLI grants are forcing upon us) infrastructure that we develop must be ‘digital infrastructure’. Right now it’s hard to tell people in, say, Cameroon that we’re building them a dictionary but it’s online and they’ll have to pay out the nose in terms of mobile data every time they access it.

Andrew_Harvey · April 28, 2022, 1:47pm

This is a great point – my documentation outputs are essentially useless from a community perspective if accessing them is costing people huge amounts of data

pathall · April 28, 2022, 1:59pm

Oh good grief, sorry @yuni , @HannahLS is indeed here already. My old man .

This is a very interesting question, worthy of a thread on its own. So the Moro project’s approach is basically “send everything” in two files, sentences.json and stories.json. All the Javascript processes that and builds out the interface.

Of course, resources like that will be cached by HTTP, so in principle it shouldn’t be downloaded more than once. Ironically, sometimes when people are querying a dictionary over and over and getting “shorter” paginated results, over time you end up downloading much more data than then “whole” database. I think this is something that requires experimentation and testing, especially in the context of constrained bandwidth. Another thing that comes to mind is compression — the web supports gzip-ing things that go back and forth, via headers I think, but to be honest I don’t know much about how that works myself. Worth investigating.

faytak · April 28, 2022, 2:03pm

It also seems quite possible to do this entirely locally - give community members the big files on a thumb drive or something, and then load the “web page” (which could be offline, in theory?) which unpacks and organizes the data. This is a model I’d like to try to elaborate on in an upcoming grant application so I am paying very close attention right now

HannahLS · April 28, 2022, 2:50pm

Hi all! Great to see such an interest in the Moro Story Corpus. I’m still catching up on this thread, and I’ll write again soon with an actual response to some of your points!

pathall · April 28, 2022, 3:05pm

Yes, I agree that this approach has legs. In fact, this is exactly what interested me about the Moro project.

The basic outline of the project is very simple:

There are some more Javascript libraries that are brought in from remote URLs (see here), but it would be just as easy to distribute those in a js/ subdirectory. And then the whole system can function independently of the web, assuming there is a web server running on a localhost. And the Moro DB team foresaw this too:

That line is one way to start up a localhost using PHP. (I believe the -S means “standalone”). But the same thing can be done in many languages — for Python you could do:

python -m http.server

Javascript is now a language outside of the browser too — I personally am a fan of deno (much simpler than npm), which can create a server pretty simply:

Interestingly, deno apps can even be compiled into an executable, cross-platform, so you could offer windows, linux, and mac versions of a server. I believe that could just be included in your thumb drive in that scenario, and double-clicked… it would be fun try this all out together.

The final benefit of this kind of approach, I think, is that if you do want to deploy to the web, it’s just a matter of learning how to copy a repository directory to a web host, no special server code required. Of course, if we’re talking collaboration and security and auth and stuff, it gets more complicated. But for just “publishing” more documentary data in an interactive way, as the Moro example shows, this approach can work.

HannahLS · May 4, 2022, 4:25am

To follow up, it seems like one question folks have is about why this particular format was used for the data. The data is in the format output by (CouchDB-based) LingSync (which is no longer actively used), which happened to be where the Moro data was stored before its current location. We’ve played with the idea of creating easy imports from other data sources, like from Flex or a csv. If there’s interest in that, it wouldn’t be too much work to throw together.

Also, re:

We’ve thought about using something like Electron (https://www.electronjs.org/) to make a static version of the site accessible offline, for use by communities with little internet access.

HannahLS · May 4, 2022, 4:27am

I’d be interested to hear about if you come up with a good way to do this!

fauxneticien · May 4, 2022, 4:49am

If thumb drive deployment is an option, then the communities have access to laptops/Android phones/etc.? I think Nick Thieberger has used LibraryBox in the past: LibraryBox just to leave a WiFi hotspot people can connect to and view material from (and the audio/video/etc. can be compressed and loaded onto there as well).