⚠ Another kind of endangered documentation

pathall · July 1, 2021, 3:34pm

We are all familiar with the problem of “legacy” documentation: important materials which are pre-digital, and thus have not made the leap into the digital era. It takes a ton of work to do such digitization — up to and including re-typing entire works.

But what about materials that were published to the web, only for those sites to go dark? That phenomenon is doubly sad, since not only is the content no longer available, it’s digital content that’s being removed.

I happen to have been poking around in materials about Nakh-Daghestanian lately (as one does, for no particular reason), and somehow ended up on a page about Chechen. Except, it’s not actually there. Here’s the link:

http://socrates.berkeley.edu/~chechen/Ch_sdcht_lat.htm

And we get this:

Everything was essentially deleted from Berkeley’s servers in 2018.

Well, the site is still available in the Internet Archive’s Wayback Machine, here:

https://web.archive.org/web/*/http://socrates.berkeley.edu/~chechen/Ch_sdcht_lat.htm

Servers are volatile

Here’s the thing. We’re talking about Berkeley here. This is not an institution which doesn’t have resources. If that university is going to let its content fade away, then we should suspect that any institution’s content is subject to fading away.

The problem can be even worse for languages with very limited resources online. Consider Beja, a Cushitic language (hi @Sosal!) which is of great typological interest. All of the external references from the Wikipedia article about the language are links to archived versions available only in the Internet Archive:

This is not the internet we want. The Internet Archive is a wonderful thing and all (I donate every year), but there should not be a single point of failure for language documentation.

What’s the solution?

What do you think? How do we combat this problem? And how do we protect new documentation on the web? After all, it is not impossible that one of the important archives that we rely on in linguistics could lose funding, heaven forfend. What happens then?

I can think of a few strategies, at least:

Publish to multiple places on the web.

Don’t just put your content in an archive. Put it in an archive, and elsewhere. This could be a domain you manage yourself, or GitHub pages, or… ~~I don’t know, maybe more than one archive?~~ This is probably a bad idea. See @JROSESLA’s comment below.

Make licensing status clear

Publish documentation with a clear documentation how the content can be reused. If you want your work to be spread far and wide, then say so. https://creativecommons.org/ might be a good option for specifying how content can be reused.

Publish machine-readable versions of the data as well as human-readable versions.

PDFs look nice, but they are hard to parse, search, and reformat. A data file like .csv, .xml, or .json is comparatively easy to do those things with. By publishing both instantiations of documentation, the data is more future-proof.

JROSESLA · July 2, 2021, 12:48pm

I have this feeling that initially most of the archives were OK with this approach of double-depositing (same collection, on two archives) but lately, there’s been a push to not double-deposit because it creates double effort, takes up twice as much storage (archives do have to pay for storage), etc. So my feeling is that your last solution here will be a tough sell.

For me, the “a website is not an archive” thinking is what needs to become more common. Even the people with the best intentions to upkeep websites and even websites managed by big institutions are subject to changes in web technology etc. A case in point, the Sáliba case study from E-MELD which had a data presentation page on the SIL website now returns a 404 error: EMELD | The LINGUIST List – click on the “View the Sáliba presentation page” under Web Presentation

pathall · July 2, 2021, 5:14pm

Huh, that’s interesting. I didn’t know there had been a shift there. Yes, perhaps I was giving bad advice there, thanks. (I’ll edit the original post.)

As for the more general point:

I think we should be encouraging multiple homes for documentation, especially because we want multiple kinds of access to be possible. That said, I agree that websites (and servers) are volatile. Unfortunately, that’s true even of archives. No site is 100% reliable. (The present one included! Costs me 12 bucks a month! )

I think we need more experimentation in how to use the web, though, not less, because the technology has improved, and the learnability of the technology has improved. Our discipline hasn’t done enough to help linguists take advantage of those improvements, though. It’s much more possible to make a really solid, well-structured site now than it ever has been. I feel like our goal should be more web-technology education and experimentation, as opposed to relying exclusively on archives to be our (only) face on the web. Different kinds of access are good.

The Sáliba link is an interesting case. Incidentally, and perhaps you knew this, but at least some of the content in the presentation you link still lingering in the Wayback Machine. It’s always a role of the dice what actually got backed up, but in this case, there are a few surviving versions of the wordlist from the presentation (no audio files, unfortunately ):

Gary Simon’s comments about durable archiving at the link are both prescient and a little ironic, given that the links in the piece have gone stale. We can’t give up on the need to “garden the web”. It’s the most accessible option we have.

Thanks so much for your thoughts!

pkaustin · July 5, 2021, 7:45am

I and a colleague have three times had universities delete websites when academic administrations changed and on each occasion we have kept the sites alive on our own personal domain. One site dates back to 1998 and is the earliest fully-hypertext bilingual dictionary on the web. Of course, once we stop paying the annual subscription fee these will disappear into the ether (apart from whatever Wayback Machine has saved). It’s all very precarious, as you say.

pathall · July 5, 2021, 12:49pm

Wow, that sounds cool. Is it still online?

pkaustin · July 5, 2021, 1:07pm

Yes, at Kamilaroi/Gamilaraay Dictionary

pathall · July 5, 2021, 8:15pm

This is amazing, thanks for sharing the link. It’s a shame that your university support for web content dried up. I have seen similar things happen — at this point it seems that many universities don’t want to maintain “content” sites at all. The question of who should pay for web content in linguistics is a vexxing one. Archives seem to function via granting agencies, which of course has its own headaches.

Could the dictionary be put in an archive now?

mcswell · July 6, 2021, 5:59pm

Perhaps it’s obvious, but just in case: no internet web site is an archive. I’m sure most (probably all) archives now have websites, but the website is only a portal to the actual archive, and the portal may change. (The SIL link was down when I first tried it this morning, but now it’s up. Maybe someone from SIL was reading this post?)

An archive is an institution which receives files in certain agreed-on formats, and promises to make them available in some fashion (perhaps only via an in-person visit, although of course that’s less common today) for an indefinite–but long–period in the future. Providers of resources may typically impose restrictions, such as only allowing community members access, etc., and the archive should agree to enforce any such restrictions. The archive (should) also promise to maintain the integrity of submissions. In the case of digital submissions, this typically means copying the data over to new media every five or ten years, to guard against bit rot (that is not a joke). And of course the archive needs some permanent funding source to do all these things, since submitters die or may otherwise become unable to pay for storage. If for some reason an archive is unable to continue its service, hopefully they will find another archive willing to take all their data.

Archives for language data include the Linguistic Data Consortium at the University of Pennsylvania (typically not for endangered languages, however), the Max Plank Language Archive, the California Language Archive (only for languages of that state, afaik), the Alaskan Native language archive (again, for languages of that state), the Endangered Languages Archive (ELAR), and others.

sunny · July 7, 2021, 11:23pm

@pathall and @pkaustin I actually was involved with helping John Giacon make a Gamilaraay site (and to a lesser extent, include the already established data from Yuwalaraay and Yuwalayaay materials; that said, it was all in Wordpress and I was hoping to get a bit more room to implement the kinds of models Pat has written and talked about…result being that I couldn’t help all too much in the end). The site has been up, down, left, right, as time has gone on, and it’s really awesome that this dictionary’s site has been consistently maintained. I don’t have anything useful to add but I just kind of wanted to say that leaving materials in multiple places has had an interesting effect as some presence has been consistent.
It’s no fancy or revolutionary idea that things vary from project to project; what could be important is that stability is very helpful, but the stability can be controlled by a number of sources and forces. Again, nothing useful to add, I just was reminded of that project when Peter mentioned Gamilaraay

pathall · July 8, 2021, 2:17pm

Thanks for sharing that, @sunny, you are everywhere!

Also, I found this wonderful set of slides about the Kamilaroi/Gamilaraay dictionary:

Here’s a map of the languages we have ended up talking about:

And here’s the cover of the published print version of the dictionary:

Also, it looks like David has created a lovely update to the dictionary:

https://dnathan.com/gaman/

This would appear to be the same content as the original, in a much more modern rendering.

Finally, an article on the history of this dictionary:

Awesome!