The FoLiA data format

lgessler · June 7, 2022, 6:01am

The FoLiA annotation standard came up today in our ComputEL-5 discussion and not everyone was familiar, so I thought it might be interesting for me to quickly describe it here, especially since some of its design goals take strong stances on some of the issues we’ve been discussing here. A caveat before we proceed: FoLiA is an XML format, and 2022 more than ever is the winter of XML, but we should look past this arguably superficial aspect of it to look at the more interesting parts.

FoLiA is a data format which very, very comprehensively covers all kinds of different annotations. Or as they put it, it uses “as few ad-hoc provisions for annotation types as possible”. See for example this morphological annotation. Well over 50 annotation types have explicit support in the standard, including content (text, speech), inline annotations (PoS, lemma, sense), span (syntactic chunking, entities, modality), structure (paragraph, quote), subtoken (morphological, phonological), and many more.

I won’t say too much more, only that it’s one of the most extreme formats I’m aware of regarding two specific characteristics. First is its machine-oriented instead of human-oriented nature: it’s not designed to be written or read by humans directly—rather, it’s meant for use in software, which will hopefully provide humans with more humane facilities for viewing and editing. Second, it tries to account for virtually every kind of linguistic analysis that’s out there, by name, instead of trying to provide general facilities which could be used for particular annotation types.