This is one of those questions that I feel like is kind of… well, I dunno, sorta rude.
People work hard on best practices for metadata management, and they work hard to develop standards and tools for creating, storing, and using metadata.
But I’ll just go ahead and ask, in the hope that things will be clarified one way or the other.
Why not put metadata in data files? At least, some of it — the crucial bits.
There is a semi-standard in data processing that goes like this:
- Decide on a file-naming scheme for your data files.
- Each text gets a unique identifier in that scheme, so
dothraki-documentation
could be the name of a project, and then bundles could be numbered, so you’d end up with001-dothraki-documentation
as a bundle, and that could include001-dothraki-documentation.eaf
,001-dothraki-documentation.wav
, etc. - Now create a spreadsheet, and put all your metadata in there, with each row keyed to the numbered files.
That spreadsheet will look more or less something like this:
file id | type | genre |
---|---|---|
001-dothraki-documentation.eaf | transcript | narrative |
001-dothraki-documentation.wav | recording | narrative |
002-dothraki-documentation.eaf | transcript | death-chant |
002-dothraki-documentation.wav | recording | death-chant |
003-dothraki-documentation.eaf | transcript | horse-praise |
003-dothraki-documentation.wav | recording | horse-praise |
Or whatever, obviously I’m grossly simplifying. I mean, maybe it doesn’t make sense to put document metadata and audio recording metadata into the same metadata file — maybe you need two tables, one for texts and another for recordings, maybe you do it by genre. Maybe you do it by bundle. Whatever the archive tells you, in the end, right?
But any such approach assumes that you’re going to be editing the spreadsheet. You have to keep the spreadsheet and all your files in sync, by hand.
Does that suck?
What if we just stuck the metadata in the file?
Let’s imagine we’re collecting our first Dothraki text, and behold, we survive our fieldwork session with, ya know, this guy:
And we get these four greetings:
{
"sentences": [
{
"translation": "Hello",
"transcription": "M'athchomaroon"
},
{
"translation": "Hi",
"transcription": "M'ath"
},
{
"translation": "Hi",
"transcription": "M'ach"
},
{
"translation": "Greetings to you all",
"transcription": "Athchomar chomakea"
}
]
}
Now, you could stick that in 001-dothraki-documentation.json
, and then go start a spreadsheet as above. But, what about all the other zillions of things you’d like to stick in there that describe those three sentences?
Like, for instance, a title? Yeah, we have dothraki-documentation
as the identifier, but that’s not a title. And we want a title. So we add a title… to the spreadsheet? Okay, voilà:
file id | type | genre | title |
---|---|---|---|
001-dothraki-documentation.eaf | transcript | narrative | Greetings |
001-dothraki-documentation.wav | recording | narrative | Greetings |
002-dothraki-documentation.eaf | transcript | death-chant | Scary things to say in battle |
002-dothraki-documentation.wav | recording | death-chant | Scary things to say in battle |
003-dothraki-documentation.eaf | transcript | horse-praise | Horses are great! |
003-dothraki-documentation.wav | recording | horse-praise | Horses are great! |
That’s… okay. But now say we decide we want tags too. And a date of recording. And a list of participants.
Tags and participants… wait, those aren’t just strings, those are going to be lists… Eh, we’ll just use a delimiter. How about a semicolon? Hooray! Now we can add tags and participants, right?
file-id | type | genre | title | tags | participants |
---|---|---|---|---|---|
001-dothraki-documentation.eaf | transcript | narrative | Greetings | general;phrases | Thirri;Halahhi |
001-dothraki-documentation.wav | recording | narrative | Greetings | general;phrases | Thirri;Halahhi |
002-dothraki-documentation.eaf | transcript | death-chant | Scary things to say in battle | yikes | Jasso |
002-dothraki-documentation.wav | recording | death-chant | Scary things to say in battle | yikes | Jasso |
003-dothraki-documentation.eaf | transcript | horse-praise | Horses are great! | neigh;animals | Chakko;Gezro;Vitihho |
003-dothraki-documentation.wav | recording | horse-praise | Horses are great! | neigh;animals | Chakko;Gezro;Vitihho |
Boy, that’s fancy. Except… why are we tagging the recording metadata row and the transcript metadata row? And is it really systematic to just pick a random delimiter for complex values?
It makes me… uneasy…
What if we used a format like XML or JSON that supports complex objects? Er, wait a minute, we stored our actual data in JSON, couldn’t we just store our spreadsheet in JSON too? Then those pesky text delimiters could be replaced by a proper array.
So we just like… make an array of objects instead of a spreadsheet. Same info, right?
[
{
"file-id": "001-dothraki-documentation.eaf",
"type": "transcript",
"genre": "narrative",
"title": "Greetings",
"tags": [
"general",
"phrases"
],
"participants": [
"Thirri",
"Halahhi"
]
},
{
"file-id": "001-dothraki-documentation.wav",
"type": "recording",
"genre": "narrative",
"title": "Greetings",
"tags": [
"general",
"phrases"
],
"participants": [
"Thirri",
"Halahhi"
]
},
{
"file-id": "002-dothraki-documentation.eaf",
"type": "transcript",
"genre": "death-chant",
"title": "Scary things to say in battle",
"tags": [
"yikes"
],
"participants": [
"Jasso"
]
},
{
"file-id": "002-dothraki-documentation.wav",
"type": "recording",
"genre": "death-chant",
"title": "Scary things to say in battle",
"tags": [
"yikes"
],
"participants": [
"Jasso"
]
},
{
"file-id": "003-dothraki-documentation.eaf",
"type": "transcript",
"genre": "horse-praise",
"title": "Horses are great!",
"tags": [
"neigh",
"animals"
],
"participants": [
"Chakko",
"Gezro",
"Vitihho"
]
},
{
"file-id": "003-dothraki-documentation.wav",
"type": "recording",
"genre": "horse-praise",
"title": "Horses are great!",
"tags": [
"neigh",
"animals"
],
"participants": [
"Chakko",
"Gezro",
"Vitihho"
]
}
]
This isn’t too bad. Now our tags
and participants
are arrays, that seems better.
But what happens if we want later realize that in text 1 we left out Drogo?? What if he finds out?? This might end very very badly. So we go into our spreadsheet, and lickety split, we add Drogo, thusly:
[
{
"file-id": "001-dothraki-documentation.eaf",
"type": "transcript",
"genre": "narrative",
"title": "Greetings",
"tags": [
"general",
"phrases"
],
"participants": [
"Thirri",
"Halahhi",
"Drogo"
]
}
…
]
Phew, crisis averted!!
Good decision…
But wait a sec. Did we make that mistake in any other texts? D’oh. Better check. So we go through every text, and we compare the contents of each text to the list of participants over in the spreadsheet. Does the string Drogo
appear anywhere? Update the spreadsheet…
But hang on again.
If we’re using JSON in the metadata and we’re using JSON in the data of the transcript, wouldn’t it be easier to put the metadata which is relevant to a given text in that text? As a matter of fact, we could go ahead and put the time alignment stuff and the audio metadata in there too. Like this:
{
"metadata": {
"file-id": "001-dothraki-documentation",
"type": "transcript",
"genre": "narrative",
"title": "Greetings",
"tags": [
"general",
"phrases"
],
"participants": [
"Thirri",
"Halahhi",
"Drogo"
],
"media": [
{ "audio": "001-dothraki-documentation.wav" }
]
},
"sentences":[
{
"translation": "Hello",
"transcription": "M'athchomaroon",
"end": 2,
"start": 0
},
{
"translation": "Hi",
"transcription": "M'ath",
"end": 4,
"start": 2
},
{
"translation": "Hi",
"transcription": "M'ach",
"end": 6,
"start": 4
},
{
"translation": "Greetings to you all",
"transcription": "Athchomar chomakea",
"end": 8,
"start": 6
}
]
}
Now we’ve got timestamps and some useful metadata in a single place. We could write a script to slurp that metadata out of all our texts and generate a spreadsheet, that would be pretty easy. But we don’t have to do that synchronizing-by-hand business.
So there’s my hopefully-not-too rude question.