The organization of fieldwork files

Greetings,

I am doing some reflecting on my fieldwork in Mexico and Nigeria. I am also reading Holmes 1964 where he strongly advocates for five levels of organization within a hearcical system. His contexts were corporate papers in large archives and preservation institutions. I’m trying to see if there might be any wisdom here if it were recontextualized, or if this is just dogma. I’m also curious about the habits of others. When you create files as part of your fieldwork how do you organize them? (this is intentionally broadly stated as you might have different systems of organization in different contexts and tasks, btw what are those contexts and tasks?)

I am wondering what works for those of you who do digital fieldwork… How many levels of folders do you use? What motivates you to create a new folder rather than lump things together? How can you describe your system of organization?

Holmes, Oliver W. 1964. “Archival Arrangement—Five Different Operations at Five Different Levels.” The American Archivist 27 (1). Society of American Archivists: 21–42. Archival Arrangement—Five Different Operations at Five Different Levels on JSTOR.

1 Like

I use a fairly flat structure: transcriptions, audio/video files, secondary analysis (e.g. toolbox data files), admin (receipts, grant materials), and publications. Within the publications, each publication will get its own folder, and occasionally there’ll be another level in other folders (e.g. by fieldworker). Other product might get their own primary folder, particularly if they’re complicated (e.g. dictionary).

2 Likes

I’ve worked with a structure like this (for instance, on a working archive of Loma) (this is simplified, therea re :

  • language/
    • loma-language.json // contains metadata about names, language codes, phonetic inventory, and orthography
  • grammar/
    • loma-grammar.json // this contains an index of grammatical categories: metadata again, then an array of objects containing abbreviations, values, and category names (like {"abbreviation": "ACC", "value": "accusative", "category": "case" } )
  • lexicon/
    • loma-lexicon.json // an object containing { "metadata": {…}, "words": [ {form: "", gloss: ""} ] } These are the unique words across the corpus. (And there is a lot more than form and gloss in each word, depending on typology, existing data, etc.)
  • corpus/ // this is the big one. It contains a directory per text, with a “slug” name that is used in ids and such and also for the text’s directory:
    • story_1/ (for instance)
      • story_1.json // a text. This is an object: { metadata: { title, etc. }, sentences: [ { transcription: "", translation: "", words: [ { form: "", gloss: "" }, … ]
    • …more text directories here.

The text directories work like what are sometimes called bundles in archives: any associated assets that belong with a text are in that directory, and the text JSON object points to those assets via relative (or if necessary, absolute) URLs.

The layout maps onto the Boasian Trilogy by design. I prioritize transparency: a linguist with an understanding of basic linguistics should be able to have an understanding of the kind of data stored in this way, even if they aren’t programmers.

I have a weird obsession with JSON files, because I feel like they are very self-explanatory, and thus have a fighting chance of being understood by someone in the future if they are somehow separated from their proper context. So my approach is to store data in this format, and then to use web components to display that content and make it interactive.

Any chance for a $ tree output for an example @pathall or @cbowern

Tree usage on linux or mac

what I wrote is equivalent to the output of tree

@pathall I see, but what about .wav or .eaf files… maybe your work doesn’t use these?

.wav files would either go alongside the corresponding text (corpus/story_1/story_1.wav) or would be linked tothe web with a URL.

As for .eaf files, I treat them as formats that need to be normalized to json (same for flex files). I usually would put them in, e.g., corpus/story_1/originals/, and then generate JSON text format files from those. The json files become canonical in my approach.

1 Like

For mine, I tend to keep the audio and transcripts in separate directories, a legacy from the times I had to back stuff up on dialup (transcripts only directory far smaller than the audio files, which got backed up to external media or burned to CD

I organize my files first by language and then in individual folders for each session, where each folder has an ID like (SE_PN001), which means Spoken Ende Prompted Narrative #1. Each folder contains the raw .mp4, .wav, .txt, or .pdf files, then the secondary .eaf, .flextext, .txt, or .textgrid files, along with a metadata file. All the metadata is organized in a global spreadsheet, which keeps track of which ones are transcribed, translated, and interlinearized.

2 Likes

Interesting. Question about your metadata workflow: you mention a metadata file and a global spreadsheet — are those tracking different metadata?

@pathall When you say “FLEx files” are you referring to FLExtext files or are you referring to LIFT files which are lexicon files or are you talking about the .fwdata file which is the raw XML database that FLEx generates and is more descriptive than LIFT or FLExtext?

The folks that I work with use .flextext files as the default export format, so that’s what we’ve been parsing.

Like @fmatter I am not clued in on the whole fwdata business.

LIFT is an XML interchange format designed to be used to communicate between “simple” apps (like WeSay) and FLEx.

Lift Tools is suite of tools to clean up data input inconsistencies.

In contrast FLEXbridge is a full database sync tool to sync the full content of FLEx between machines/users… But unlike git, FLEx’s internal data structure does not have a user model. that is it is not possible to sort a database by the edits/contributions of a single user in a multi-user project.

people working with FLEX databases may want to be aware of FLEx tools:

And this list of publishing and third party tools.

There is a bit of an old commentary here: Software Needs for a Language Documentation Project | The Journeyler

The metadata in each folder is a copy/subset of what is available in the full spreadsheet, but is there as a failsafe in case the items get separated from the spreadsheet.

2 Likes

Makes sense to me. I have been experimenting with a some more extreme approach: put the metadata in the data. So for a lexicon structured I’d do:

{
  "metadata": {
    "title": "Education in Jaro",
    "language": "Hiligaynon",
    "source": "https://www.youtube.com/watch?v=cUqMWG4QJMk",
    "media": "education_in_jaro.webm",
    "fileName": "education_in_jaro-text.json",
    "lastModified": "2019-07-17T16:01:06.046Z",
    "notes": [
      "transcribed with Joshua De Leon as part of the 2014 Fieldmethods class at UCSB, instructor Marianne Mithun.",
      "original YouTube title ‘School Memories’"
    ],
    "speakers": [
      "Juan Lee"
    ],
    "linguists": [
      "Patrick Hall",
      "Joshua De Leon"
    ],
    "links": [
      {
        "type": "audio",
        "file": "education_in_jaro.wav"
      },
      {
        "type": "notes",
        "url": "http://localhost/Languages/hiligaynon/ucsb-fieldmethods/Notes/hil111_2013-02-25_JDL_PH_IlonggoBoyEducationInJaro.notes.txt"
      }
    ]
  },
  "sentences": [
    {
      "transcription": "Hello, akó si Juan Lee.",
      "translation": "Hello, I’m Juan Lee.",
      "words": [
        {
          "form": "hello",
          "gloss": "hello",
          "lang": "en"
        },
        {
          "form": "akó",
          "gloss": "1S.ABS"
        },
        {
          "form": "si",
          "gloss": "PERS"
        },
        {
          "form": "Juan",
          "gloss": "Juan",
          "tags": [
            "name"
          ]
        },
        {
          "form": "Lee",
          "gloss": "Lee",
          "tags": [
            "name"
          ]
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 8.58,
            "end": 9.65
          }
        ]
      },
      "note": ""
    },
    {
      "transcription": "matopic na akó subóng",
      "translation": "I’ll start the topic now",
      "words": [
        {
          "form": "ma-topic",
          "gloss": "IRR-topic",
          "tags": [
            "english"
          ],
          "metadata": {
            "wordClass": "verb"
          }
        },
        {
          "form": "na",
          "gloss": "already"
        },
        {
          "form": "akó",
          "gloss": "1S.ABS"
        },
        {
          "form": "subóng",
          "gloss": "now"
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 9.704,
            "end": 10.714
          }
        ]
      },
      "note": "As if he's going to start a new topic; Of ma-, J says: “Most of the time you use it before an action verb: the process of changing the topic is itself a verb: the process of changing the"
    },
    {
      "transcription": "parte sa mga eskwélahan.",
      "translation": "about the schools",
      "words": [
        {
          "form": "parte",
          "gloss": "part",
          "tags": [
            "spanish"
          ]
        },
        {
          "form": "sa",
          "gloss": "to"
        },
        {
          "form": "mga",
          "gloss": "PL"
        },
        {
          "form": "eskwélahan",
          "gloss": "school",
          "tags": [
            "spanish",
            "spanish:escuela"
          ]
        }
      ],
      "tags": [],
      "metadata": {
        "links": [
          {
            "type": "timestamp",
            "start": 10.753,
            "end": 11.818
          }
        ]
      },
      "note": ""
    }
  ]
}

I have found that keeping metadata in the data file like that prevents separation pretty effectively. It does have the problem that there is no (default) UI enforcing the same fields across objects — even a spreadsheet interface does that pretty effectively. So doing this in practice at scale would require either some discipline or else a custom interface.

To be honest, I don’t even worry about consistency from one file to the next. The crucial fields (language, speaker, title, etc) are always going to get included, and everything else is useful or at least informative down the road.

1 Like