💻 What format should I store data in?

pathall · April 20, 2020, 1:47pm

Continuing the discussion from Anybody else working with historical documents?:

Albert, I took the liberty of moving your question into its own topic, because I hope we can talk about it in depth.

Choices about data formats depend on several factors. Here’s my list:

Existing workflows
Software choices
What do you want to do with the data?

Quickly, in case any readers are not familiar with JSON and/or XML:

JSON stands for Javascript Object Notation. The “Javascript” bit is really kind of misleading, as JSON is not Javascript-specific. (Javascript is the default programming language built into web browsers.) The “Object notation” bit is the key part. So, “objects” are just a computational way to write down structured data — data that has parts. For instance, we could think of a “word” in an interlinear gloss as having at least two “parts” which we could call a form and a gloss:

{
  "form": "gato",
  "gloss": "cat"
}

JSON is pretty similar conceptually to XML — both kind of describe “trees” of data, but XML does it with tags (and thus looks a lot like HTML, which is used to create web pages). So the word object above might be represented as XML like this:

<word>
  <form>gato</form>
  <gloss>cat</gloss>
</word>

As you probably know, ELAN and Flex both output “dialects” of XML, .eaf for ELAN and… I forgot the suffix for Flex, but I think it’s referred to as LIFT. (Toolbox, by the way, uses its own format which is neither XML nor JSON.)

Conceptually, though, these two data formats are “saying” pretty much the same thing. They both are saying something like “look, I have this thing called a word, and it has these two parts named form and gloss, and the values for those parts are gato and cat, respectively.

You could picture that visually, perhaps:

I like to think of something like this as a “data type”. Obviously a “word” is going to need far more information than this, but I would argue that for documentary linguists, at least, you can’t not have these two fields at the very least. A word must have a form and a gloss (or something very similar), but it may have whatever else you want (other translations, definitions, transliterations, etc, etc).

So documentary data is going to get much more complicated than this.

But this is kind of abstract: the crucial question at the beginning of a fieldwork project like yours is “What am I planning to do with the data?”. You mention that you want to build a corpus from three sources: Jochelson, contemporary fieldwork, and potentially your own fieldwork. It seems to me that one primary goal will be to search everything together: “Does this old form that Jochelson found still occur?”, say. And to do that, ideally, the search would be unified — i.e., you wouldn’t have to use ELAN to search your fieldwork, some other tool to search Jochelson, etc. So unified search assumes either that you canonical stored representation of your data is all the same, or else that you have a “conversion path” where you can (hopefully automatically) get from all your representations into a unified format.

So I feel like the answer to your question of what format you should use is going to depend on what your actual fieldwork plans are. I myself am trying to design tools that do use JSON pretty much everywhere, but I also want to have converters from other formats: it would be silly of me to pretend that people would stop using ELAN all of a sudden. I have had some success importing simple content in ELAN, Flex, and Toolbox files, but those formats are pretty complicated and it’s an ongoing process.

So I guess what I’m asking is, can you give us some more details about the state of your starting data? Are you beginning with PDFs? Scans of fieldnotes? What would be your ideal workflow?

Thanks for telling us about your work! The beginning of a project is very exciting… sounds like you have lots of great avenues for research.

Sandra · April 20, 2020, 3:33pm

I haven’t dealt much with corpus data, so feel free to take whatever seems useful from my ‘insights’. I did a bunch of typology projects and I’m now digitizing a bunch of Mixtec data from other people and published sources - dealing with lots of sources, formats, etc. really is challenging! I’ve found that in initial stages when I’m not yet a 100% clear about what the output should be and what I’ll use it for, the best is to opt for a format that is both easy to use and easily convertible to other formats. For me that has largely been csv files, but they might not be that good for corpus data. That way you’re not wasting time on learning something you might not end up using and as soon as you know where you’re headed you can convert your data (with hopefully minimal adjustments). I’m sure @pathall can advise more on what that would mean in terms of corpus data.

aventayolboada · April 20, 2020, 7:07pm

This is a super useful conversation. Thanks for setting it as a separate topic @pathall

Re data sources: The legacy materials are all in PDFs; some of that is run through OCR and you can copy-paste text from them. It doesn’t always work for all the characters and I haven’t dived in the materials in (pre-reform) Russian Cyrillic. The transliteration into the Roman script for some texts is super obscure to me. Most of the contemporary texts are appendices to grammars, some stuff is online in shady websites… Again, with different transliteration conventions. Anything I do in my fieldwork will most likely be in ELAN.

What I’m planning to do is to run R for string analysis and stats. There is a list of areal features that people have identified for Siberia as a whole (from cases to clause-linking). The idea is to zero in on some of these features and see if there are differences in their distribution between both time periods in each Yukaghir variety. For the features for which there are significant changes, then one hypothesis would be contact. Since Tungusic, Turkic and Chukotko-Kamchatkan langs have relatives outside the area, the idea is to identify the direction of change. The legacy materials contain a lot of sociolinguistic information that is super useful for that. And I can also use my own sociolinguistic interviews (I have some interviews with elders and I know what languages they were exposed to and in what contexts).

aventayolboada · April 20, 2020, 7:12pm

The thing that I like about XML (as opposed to csv I guess) is that I can have coarse-grained annotation for now and build it up over time. Right now I have sentence tags without dealing with each word separately. I will eventually have a word-by-word gloss and ideally at some point with all morphological structure too (since the languages are agglutinating), but I can do “some” of it with R string analysis for now.

Sandra · April 21, 2020, 9:06am

Re OCR: There are python packages that can OCR and extract text to a table. I didn’t end up using them, because the OCR just didn’t work well with all the diacritics and a informal study revealed that it was faster if my student assistant just types it up vs. doing OCR, extracting the text, and then hand-correcting it. But if you have high quality PDFs with little diacritics it might be worth it. Also note that not all OCR is created equal. The free ones are not as good as paid options, unfortunately.