Technical Metadata for text objects

hp3 · January 4, 2023, 9:23pm

Has anyone used this metadata standard for describing text layouts?

Technical Metadata for Text (textMD) Official Web Site (Standards, Library of Congress) it was designed to be part of the METS standard but can be used as a standalone metadata standard.

xrotwng · January 5, 2023, 10:32am

If that standard doesn’t have much software support (and that’s what it looks like), it seems to not offer much more than Dublin core.

hp3 · January 13, 2023, 8:34am

@xrotwng I’m not sure I’m understanding you…

are you saying that this standard is useless in the same way that dublin core is useless because neither has software support?

OR

do you see a way that dublin core encodes the following characteristics of textual materials in a clear and straightforward manner?

encoding, character_info, language, alt_language, font_script, markup_basis, markup_language, processingNote, printRequirements, viewingRequirements, textNote, pageOrder, pageSequence.

encoding_platform, encoding_software, encoding_agent

charset, byte_order, byte_size, character_size, linebreak

xrotwng · January 13, 2023, 9:17am

Oh, I think Dublin Core does have software support - or is at least present in the minds of people dealing with metadata. And a MIME Type specified as dc:format would probably go a long way towards conveying characteristics of textual material. I admit that it wouldn’t be as specific as the properties in textMD.

I guess, I can’t really understand the use cases for the textMD standard.

hp3 · January 13, 2023, 5:01pm

From the website for the standard:

textMD was originally created by the New York University Digital Library Team (NYU), and had been maintained by NYU through the current version (2.2). In October 2007, LoC assumed maintenance of textMD. This has entailed creating a listserv for fostering discussion, authoring this official web site, as well as working on updates to the textMD schema that will offer new elements.

My understanding is that the standard was used first as an in-house information format for a team using MODS and then opened up to broader use as other institutions started to show interest. Interest develops because institutional activities and projects can span decades while the workflow process can evolve within the project. This metadata offers a way to track impacts on outputs due to workflow changes. It also allows maintenance agencies to target specific corpora for updates at an encoding level.

I know that unexpected encoding differences, including custom encodings due to custom font usage was an issue that Yi and @cbowern encountered in their work. They talk about it like this:

Other issues surrounding downloads included non-functioning download buttons or downloads that resulted in unreadable data. AILLA’s download links are blocked by Chrome and Firefox browsers due to security settings, and could only be accessed by changing web browsers. The Hindu-Kush Areal Typology,56 while not strictly an archive, had a bulk download option for wordlists. However, users had to ensure that they were properly opening the UTF-8 encoded CSV file in order to read the data without broken text. While workarounds like these exist, they may deter users with less familiarity with technology from using such archives effectively.

This is the first metadata schema I have come across which formalizes these important technical attributes of a digital text. I’m sure you have dealt with line ending characters across OSes… they are a real pain. but knowing that one will need to address those issues before they select a text corpus can help set workload expectations. Lots of digital documents are not in UTF-8 or UTF-16, especially in languages which use Scripts whose characters have not or have just recently been added to Unicode.