Why do we store data and metadata in different, separate formats?

pathall · July 19, 2020, 2:04am

This is one of those questions that I feel like is kind of… well, I dunno, sorta rude.

People work hard on best practices for metadata management, and they work hard to develop standards and tools for creating, storing, and using metadata.

But I’ll just go ahead and ask, in the hope that things will be clarified one way or the other.

Why not put metadata in data files? At least, some of it — the crucial bits.

There is a semi-standard in data processing that goes like this:

Decide on a file-naming scheme for your data files.
Each text gets a unique identifier in that scheme, so dothraki-documentation could be the name of a project, and then bundles could be numbered, so you’d end up with 001-dothraki-documentation as a bundle, and that could include 001-dothraki-documentation.eaf, 001-dothraki-documentation.wav, etc.
Now create a spreadsheet, and put all your metadata in there, with each row keyed to the numbered files.

That spreadsheet will look more or less something like this:

file id	type	genre
001-dothraki-documentation.eaf	transcript	narrative
001-dothraki-documentation.wav	recording	narrative
002-dothraki-documentation.eaf	transcript	death-chant
002-dothraki-documentation.wav	recording	death-chant
003-dothraki-documentation.eaf	transcript	horse-praise
003-dothraki-documentation.wav	recording	horse-praise

Or whatever, obviously I’m grossly simplifying. I mean, maybe it doesn’t make sense to put document metadata and audio recording metadata into the same metadata file — maybe you need two tables, one for texts and another for recordings, maybe you do it by genre. Maybe you do it by bundle. Whatever the archive tells you, in the end, right?

But any such approach assumes that you’re going to be editing the spreadsheet. You have to keep the spreadsheet and all your files in sync, by hand.

Does that suck?

What if we just stuck the metadata in the file?

Let’s imagine we’re collecting our first Dothraki text, and behold, we survive our fieldwork session with, ya know, this guy:

And we get these four greetings:

{
  "sentences": [
    {
      "translation": "Hello",
      "transcription": "M'athchomaroon"
    },
    {
      "translation": "Hi",
      "transcription": "M'ath"
    },
    {
      "translation": "Hi",
      "transcription": "M'ach"
    },
    {
      "translation": "Greetings to you all",
      "transcription": "Athchomar chomakea"
    }
  ]
}

Now, you could stick that in 001-dothraki-documentation.json, and then go start a spreadsheet as above. But, what about all the other zillions of things you’d like to stick in there that describe those three sentences?

Like, for instance, a title? Yeah, we have dothraki-documentation as the identifier, but that’s not a title. And we want a title. So we add a title… to the spreadsheet? Okay, voilà:

file id	type	genre	title
001-dothraki-documentation.eaf	transcript	narrative	Greetings
001-dothraki-documentation.wav	recording	narrative	Greetings
002-dothraki-documentation.eaf	transcript	death-chant	Scary things to say in battle
002-dothraki-documentation.wav	recording	death-chant	Scary things to say in battle
003-dothraki-documentation.eaf	transcript	horse-praise	Horses are great!
003-dothraki-documentation.wav	recording	horse-praise	Horses are great!

That’s… okay. But now say we decide we want tags too. And a date of recording. And a list of participants.

Tags and participants… wait, those aren’t just strings, those are going to be lists… Eh, we’ll just use a delimiter. How about a semicolon? Hooray! Now we can add tags and participants, right?

file-id	type	genre	title	tags	participants
001-dothraki-documentation.eaf	transcript	narrative	Greetings	general;phrases	Thirri;Halahhi
001-dothraki-documentation.wav	recording	narrative	Greetings	general;phrases	Thirri;Halahhi
002-dothraki-documentation.eaf	transcript	death-chant	Scary things to say in battle	yikes	Jasso
002-dothraki-documentation.wav	recording	death-chant	Scary things to say in battle	yikes	Jasso
003-dothraki-documentation.eaf	transcript	horse-praise	Horses are great!	neigh;animals	Chakko;Gezro;Vitihho
003-dothraki-documentation.wav	recording	horse-praise	Horses are great!	neigh;animals	Chakko;Gezro;Vitihho

Boy, that’s fancy. Except… why are we tagging the recording metadata row and the transcript metadata row? And is it really systematic to just pick a random delimiter for complex values?

It makes me… uneasy…

What if we used a format like XML or JSON that supports complex objects? Er, wait a minute, we stored our actual data in JSON, couldn’t we just store our spreadsheet in JSON too? Then those pesky text delimiters could be replaced by a proper array.

So we just like… make an array of objects instead of a spreadsheet. Same info, right?

[
  {
    "file-id": "001-dothraki-documentation.eaf",
    "type": "transcript",
    "genre": "narrative",
    "title": "Greetings",
    "tags": [
      "general",
      "phrases"
    ],
    "participants": [
      "Thirri",
      "Halahhi"
    ]
  },
  {
    "file-id": "001-dothraki-documentation.wav",
    "type": "recording",
    "genre": "narrative",
    "title": "Greetings",
    "tags": [
      "general",
      "phrases"
    ],
    "participants": [
      "Thirri",
      "Halahhi"
    ]
  },
  {
    "file-id": "002-dothraki-documentation.eaf",
    "type": "transcript",
    "genre": "death-chant",
    "title": "Scary things to say in battle",
    "tags": [
      "yikes"
    ],
    "participants": [
      "Jasso"
    ]
  },
  {
    "file-id": "002-dothraki-documentation.wav",
    "type": "recording",
    "genre": "death-chant",
    "title": "Scary things to say in battle",
    "tags": [
      "yikes"
    ],
    "participants": [
      "Jasso"
    ]
  },
  {
    "file-id": "003-dothraki-documentation.eaf",
    "type": "transcript",
    "genre": "horse-praise",
    "title": "Horses are great!",
    "tags": [
      "neigh",
      "animals"
    ],
    "participants": [
      "Chakko",
      "Gezro",
      "Vitihho"
    ]
  },
  {
    "file-id": "003-dothraki-documentation.wav",
    "type": "recording",
    "genre": "horse-praise",
    "title": "Horses are great!",
    "tags": [
      "neigh",
      "animals"
    ],
    "participants": [
      "Chakko",
      "Gezro",
      "Vitihho"
    ]
  }
]

This isn’t too bad. Now our tags and participants are arrays, that seems better.

But what happens if we want later realize that in text 1 we left out Drogo?? What if he finds out?? This might end very very badly. So we go into our spreadsheet, and lickety split, we add Drogo, thusly:

[
  {
    "file-id": "001-dothraki-documentation.eaf",
    "type": "transcript",
    "genre": "narrative",
    "title": "Greetings",
    "tags": [
      "general",
      "phrases"
    ],
    "participants": [
      "Thirri",
      "Halahhi",
      "Drogo"
    ]
  }
  …
]

Phew, crisis averted!!

Good decision…

But wait a sec. Did we make that mistake in any other texts? D’oh. Better check. So we go through every text, and we compare the contents of each text to the list of participants over in the spreadsheet. Does the string Drogo appear anywhere? Update the spreadsheet…

But hang on again.

If we’re using JSON in the metadata and we’re using JSON in the data of the transcript, wouldn’t it be easier to put the metadata which is relevant to a given text in that text? As a matter of fact, we could go ahead and put the time alignment stuff and the audio metadata in there too. Like this:

{
  "metadata": {
    "file-id": "001-dothraki-documentation",
    "type": "transcript",
    "genre": "narrative",
    "title": "Greetings",
    "tags": [
      "general",
      "phrases"
    ],
    "participants": [
      "Thirri",
      "Halahhi",
      "Drogo"
    ],
    "media": [
       { "audio": "001-dothraki-documentation.wav" }
    ]
  },
  "sentences":[
  {
    "translation": "Hello",
    "transcription": "M'athchomaroon",
    "end": 2,
    "start": 0
  },
  {
    "translation": "Hi",
    "transcription": "M'ath",
    "end": 4,
    "start": 2
  },
  {
    "translation": "Hi",
    "transcription": "M'ach",
    "end": 6,
    "start": 4
  },
  {
    "translation": "Greetings to you all",
    "transcription": "Athchomar chomakea",
    "end": 8,
    "start": 6
  }
]
}

Now we’ve got timestamps and some useful metadata in a single place. We could write a script to slurp that metadata out of all our texts and generate a spreadsheet, that would be pretty easy. But we don’t have to do that synchronizing-by-hand business.

So there’s my hopefully-not-too rude question.

Sandra · July 20, 2020, 7:25am

My answer (forgive me if it’s too obvious):
I work in a team, not alone. The team is mostly 2 core members plus a collaborator in the field site and student assistants. Not all people need to work on both metatdata and data files, so jamming them together could be overwhelming for some collaborators, and would actually make it harder for me to coordinate and keep track of everybody’s work.
Ultimately, the metadata format is decided by the archive/funding body, not by the researcher. Quite often they change formats every few years or even every year. Correct me if I’m wrong, but this seems easier to handle if you store metadata apart from data.

pathall · August 26, 2022, 6:37pm

Golly, I just noticed your response 2 years later!! (Have we been here that long??)

Belated thanks for both of these interesting responses. Certainly the summary metadata documents are useful —but it seems to me that it would be much easier to have a system where you’ve got (say) a directory hierarchy like:

data/
  session-001/ # or whatever
    001-text.json # a transcription, which links to…
    001.wav # audio…
  session-002/ # or whatever
    002-text.json # a transcription, which links to…
    002.mp4 # video…
  session-nnn/ # more stuff…

Then alongsdie that would be a bit of code that:

goes into all the subdirectories of data/
finds all the JSON files
grabs the object labeled metadata
creates an array out of all those objects
generates a table as .csv or .html or whatever that’s easy to read

In this way everyone on the project is just concentrating on the metadata right next to their transcription content, and people are competing to edit an .xls on Google sheets or something.

Obviously, there would need to be some plumbing to keep such a system going, but I have been treating metadata this way for a long time (in my own gulp mostly unpublished stuff) and I find that it makes understanding old work, even my own old work, much easier to use later.

cbowern · September 1, 2022, 3:46pm

It might be in part because some of the tools we are used to dealing with (historically) don’t make it easy to store metadata in the file. Sometimes I think we still too often think of language documentation as working with analog tapes and written transcripts, except now it’s digital, but we still treat the objects in the same way as we treated analog field materials.

007v · September 1, 2022, 10:24pm

i think it makes sense to include metadata in the data (e.g. as a header). apart from the ease of generating the spreadsheet of metadata, having metadata in the data is also helpful for understanding the data when one revisits the project (like after two years?) transcripts and audios that state the language, participants, genre, etc. in the beginning are more accessible than the ones that don’t. with the ones that don’t, we’ll have to wonder(or do a bit detective work to figure out) what the language/genre/etc., is or pray that we find the relevant metadata spreadsheet.