Are there feeds of new archive deposits?

pathall · April 3, 2022, 5:45pm

It occurred to me that it would be interesting to see updates from languge documentation archives (AILLA, ELAR, PARADISEC, etc) when a new deposit is made available, for several reasons:

We should celebrate our colleagues’ accomplishments!
We should help being attention to newly documented languages
We should try to learn from how recent archival repositories are put together

I know announcements of this kind go up on Twitter and blogs and mailing lists and stuff, but I figured it might be fun to try to work together to build a little list ourselves.

I’ll see if I can find a few to add below, feel free to add to this list.

pathall · April 4, 2022, 12:34am

So I spent some time on this today and honestly, I didn’t find a whole lot. The only announcements I have seen are via Twitter — some archives have blogs, but often those are used more for announcements (conferences, calls for papers, grant news, etc) than for changes to the archives themselves.

When you step back and think about it, it’s kind weird. Wouldn’t one expect archives to be highlighting deposits? Or am I missing stuff?

joeylovestrand · April 4, 2022, 11:57am

Had to think about this for a minute, but it seems like OLAC should have this information, and if not a feed, there should be a way to get recent update via search, as in:

http://dla.library.upenn.edu/dla/olac/search.html?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A"Collection"

xrotwng · April 4, 2022, 1:26pm

@joeylovestrand 's solution should work. But you can also “cut out the middleman”, i.e. query the data that OLAC queries as well: An archive’s OAI-PMH data provider. OAI-PMH allows specifying a from parameter for the ListRecords verb, so all PARADISEC records from 2022 are https://catalog.paradisec.org.au/oai/item?verb=ListRecords&from=2022-01-01&metadataPrefix=olac

xrotwng · April 4, 2022, 1:28pm

Oh and an archive’s OAI-PMH “end point” is listed as “Base URL” on OLAC’s archive details page, e.g. OLAC - Archive details

xrotwng · April 4, 2022, 1:30pm

I’d still say that an OAI-PMH data provider isn’t exactly “highlighting deposits”

pathall · April 4, 2022, 1:40pm

Huh, interesting, thanks @joeylovestrand!

FWIW I did find a feed link in there (RSS):

http://dla.library.upenn.edu/dla/olac/feeds/search.rss?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A"Collection"&

Nice. That makes it pretty easy to generate an HTML page like the one on OLAC dynamically. I wrote a crude little Deno script to do that:

Script to convert OLAC feed into HTML

import { DOMParser, Element } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";


let url = `http://dla.library.upenn.edu/dla/olac/feeds/search.rss?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A%22Collection%22&`

let response =  await fetch(url)
let xml = await response.text()
let dom = new DOMParser().parseFromString(xml,'text/html')


let links = Array.from(dom.querySelectorAll('item'))
.map(item => {
  let link = item.querySelector('link').textContent
  let title = item.querySelector('title').textContent
  let description = item.querySelector('description').textContent || ""
  
  return {link, title, description}
})

let page = `<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Recent Language Archive Deposits</title>
</head>
<body>
<h1>Recent Language Archive Deposits</h1>
  
<ul>

${links.map(link  => `<li><a href="${link.link}">${link.title}</a> ${link.description}</li>`)
.join('\n')}
</ul>

</body>
</html>` 

Deno.writeTextFileSync('archive-feed.html', page)

Crude, but it does what it says on the tin:

http://docling.net/archive-feed.html

Obviously this is sort of pointless given that the page is already online with that exact information; but XML is much easier to parse than HTML. Maybe, for instance, we could figure out a way to publish this feed to this forum automatically.

pathall · April 4, 2022, 1:53pm

Man, it’s exciting to have so much expertise in the room.

I confess I have never dug into the OLAC docs, and I should have — the URL you link provides more granular data, which could be useful. Considering just the first record:

<record xmlns="http://www.openarchives.org/OAI/2.0/">
  <header>
    <identifier>oai:paradisec.org.au:AC1-220</identifier>
    <datestamp>2022-02-09T22:26:10Z</datestamp>
  </header>
  <metadata>
    <olac:olac xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:dcterms="http://purl.org/dc/terms/"
      xmlns:olac="http://www.language-archives.org/OLAC/1.1/" xsi:schemaLocation="&#xA;          http://www.openarchives.org/OAI/2.0/oai_dc/&#xA;          http://www.openarchives.org/OAI/2.0/oai_dc.xsd&#xA;          http://purl.org/dc/elements/1.1/&#xA;          http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd&#xA;          http://purl.org/dc/terms/&#xA;          http://www.language-archives.org/OLAC/1.1/dcterms.xsd&#xA;          http://www.language-archives.org/OLAC/1.1/&#xA;          http://www.language-archives.org/OLAC/1.1/olac.xsd&#xA;        ">
      <dc:title>Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.</dc:title>
      <dc:identifier>AC1-220</dc:identifier>
      <dc:identifier xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220</dc:identifier>
      <dc:subject xsi:type="olac:linguistic-field" olac:code="language_documentation"/>
      <dcterms:created xsi:type="dcterms:W3CDTF">1970-01-01</dcterms:created>
      <dc:date xsi:type="dcterms:W3CDTF">1970-01-01</dc:date>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.wav</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.mp3</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.eaf</dcterms:tableOfContents>
      <dc:contributor xsi:type="olac:role" olac:code="compiler">Arthur Capell</dc:contributor>
      <dc:contributor xsi:type="olac:role" olac:code="recorder">Arthur Capell</dc:contributor>
      <dc:subject xsi:type="olac:language" olac:code="bpa"/>
      <dc:subject xsi:type="olac:language" olac:code="mwo"/>
      <dc:subject xsi:type="olac:language" olac:code="pgk"/>
      <dc:language xsi:type="olac:language" olac:code="bpa"/>
      <dc:language xsi:type="olac:language" olac:code="mwo"/>
      <dc:language xsi:type="olac:language" olac:code="pgk"/>
      <dc:format>Digitised: yes
Media: LR Audio-tape Type 961. Plastic spool. No tape lead-in. Good condition.
Audio Notes: Operator: Nicholas Fowler-Gilmore 
Tape Machine: StuderA810
Soundcard: RME HDSPe AIO 
A/D Converter: DAD2402 
File: 24bit96kHz, Stereo 
Speed: 3.75ips 
Listening Quality: Good. </dc:format>
      <dc:coverage xsi:type="dcterms:ISO3166">VU</dc:coverage>
      <dc:coverage xsi:type="dcterms:Box">northlimit=-15.026; southlimit=-16.312; westlimit=167.614; eastlimit=168.165</dc:coverage>
      <dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>
      <dc:subject xsi:type="olac:linguistic-field" olac:code="text_and_corpus_linguistics"/>
      <dc:type xsi:type="dcterms:DCMIType">Sound</dc:type>
      <dcterms:accessRights>Open (subject to agreeing to PDSC access conditions)</dcterms:accessRights>
      <dc:rights>Open (subject to agreeing to PDSC access conditions)</dc:rights>
      <dcterms:bibliographicCitation>Arthur Capell (collector), Arthur Capell (recorder), 1970. Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.. TIFF/JPEG/X-WAV/MPEG/XML.  AC1-220 at catalog.paradisec.org.au. https://dx.doi.org/10.4225/72/56E97D93249EF</dcterms:bibliographicCitation>
      <dc:description>Audit of file (20220210) suggests only two languages on this recording, perhaps Rerep (Malekula) and Baiap (at 26:34) . Marked Side 1/2 on box, but on tape, side 1. is identified as side 2. -- Side 1: Revepe (Holvanua), Maewo 'Prodigal Son' - The first is Retep or Pangkumu, an Austronesian dialect of East Malekula, Vanuatu;  Maewo is an island much further north. --  Side 2: Baiap (Ambrym) Word List - Dialect of the Ambryn Island Austronesian language Dakaka, Central Vanuatu.
(no side b). Language as given: Revepe (Holvanua), Maewo, Baiap (Ambrym)</dc:description>
    </olac:olac>
  </metadata>
</record>

So from there we can get to this bit:

Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.

Which is informative but unfortunately not really structured: it’s not clear to me what this means — presumably Revepe is a speaker, and Holvanua a… place? Or is Maewo Revepe a person’s name, maybe? Etc.

Still, it would be useful to someone who is a specialist in this area to be informed of this data.

xrotwng · April 4, 2022, 2:13pm

As in many other cases with linguistic data there seems to be a lack of transparent re-use cases. OLAC doesn’t seem to be used very systematically by many, and the OAI-PMH data from archives is probably only used by OLAC, so there’s not much feedback on its usability either.

But as I said elsewhere, more people in linguistics in both roles - data creators and data users (also data of others) - could be the way out of this dilemma.

pathall · April 4, 2022, 2:14pm

Neat. This the <select> on that page turns up something that’s interesting in its own right, a listing of language archives, putting it here for the heck of it…

Aboriginal Studies Electronic Data Archive (ASEDA)
Academia Sinica Collections
AfBo: A world-wide survey of affix borrowing
African Language Materials Archive
Alaska Native Language Archive
APiCS Online
Archive of the Indigenous Languages of Latin America (AILLA)
BAS Repository
C’ek’aedi Hwnax Ahtna Regional Linguistic and Ethnographic Archive
California Language Archive
Central Institute of Indian Languages: Publications
CHILDES Data repository
COllections de COrpus Oraux Numeriques (CoCoON ex-CRDO)
Comparative Corpus of Spoken Portuguese
The CrÃºbadÃ¡n Project
Dictionaria
A Digital Archive of Research Papers in Computational Linguistics
ELRA Catalogue of Language Resources
Endangered Languages Archive
Ethnologue: Languages of the World
Eurac Research CLARIN Centre
Glottolog 4.5
Graduate Institute of Applied Linguistics Library
ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A. Zampolli”, National Research Council, in Pisa
IULA UPF OAI Archive
Kaipuleohone
The Language Archive
Language Commons Language Corpora
Language Documentation and Conservation
Language resources at the Text Laboratory
LAPSyD
The LDC Corpus Catalog
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÃšFAL), Faculty of Mathematics and Physics, Charles University
The LINGUIST List Language Resources
Living Archive of Aboriginal Languages
Lund University Humanities Lab corpusserver
Magoria Books’ Carib and Romani Archive
Multimodal Learning and teaching Corpora Exchange
The Natural Language Software Registry
ODIN - The Online Database of Interlinear Text
Oxford Text Archive
Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
Pacific Collection at the University of Hawai’i at MÄnoa Hamilton Library
PHOIBLE 2.0
POLLEX-Online
The Rosetta Project: A Long Now Foundation
Library of Human Language
SAILS Online
SIL Language and Culture Archives
Slovenian language resource repository CLARIN.SI
The Sociolinguistic Archive and Analysis Project (SLAAP)
Speech and Language Data Repository (SLDR/ORTOLANG)
Surrey Morphology Group Databases
TALKBANK Data repository
Tibetan and Himalayan Digital Library
transnewguinea.org
TST-Centrale
The Typological Database Project
U Bielefeld Language Archive
WALS Online
WALS Online RefDB
Webonary Sites
WOLD

xrotwng · April 4, 2022, 2:16pm

Before you get too excited, also check the last column in the table here Open Language Archives Community and the “Current as of” date on details pages like OLAC - Archive details

pathall · April 4, 2022, 2:18pm

Oh I’m never excited, I’m very blasé.

You mean the fact that so many archives are inactive?

Wait, I do get excited.

pathall · April 4, 2022, 2:25pm

Random note:

Ian Maddieson & company’s LAPSyD phonological typology database has a “recent updates” sidebar (thought it’s not a feed).

But that does turn up recent work — Dahalo was updated a few days ago:

xrotwng · April 4, 2022, 2:26pm

I don’t know if the archives are inactive. Often, I’d guess, the OAI-PMH interface may just be neglected - possibly because OLAC is considered irrelevant? Unfortunately, OAI-PMH is a protocol that’s somewhat cumbersome - but there’s a cheap way to support it called static repository gateway, which I’d guess a couple of the archives are using. You basically just put a file somewhere on a server. That’s cheap, and … easy to forget about.

I know that I’m in charge of roughly 25% of the listed archives that could be crawled successfully. And with the exception of Glottolog data in the others rarely changes, and if you wanted to know, you’d rather check CLDF Datasets · GitHub for activity …

xrotwng · April 4, 2022, 2:31pm

Oh, speaking of CLDF Datasets · GitHub : Released versions of these datasets are archived with Zenodo and appear in its cldf-datasets community - which has an OAI-PMH feed: https://zenodo.org/oai2d?verb=ListRecords&set=user-cldf-datasets&metadataPrefix=oai_dc

hp3 · June 22, 2022, 2:18pm

@pathall your question is not well formed in my opinion. Contrary to your assertion, the data is well formed (structured). It is in a documented XML format. What is not well formed in this case is the contents of the data in the title element. The creation of titles of collections and individual works are under the auspices of individual archives. If an archive allows a crazy title, then OLAC displays a crazy title. I have solidarity with you that the title in this case is not great. But that is not OLAC’s fault, that is PARADISEC’s fault for allowing un-informative titles to be created. Archivists (outside of linguistics) have ample documentation and rules for how collections should be titled. One that I think is really logical is DACS.

hp3 · June 22, 2022, 2:20pm

@xrotwng Can you help me better understand what you mean by “OLAC doesn’t seem to be used very systematically by many”?

xrotwng · June 22, 2022, 3:41pm

Ok, I hope the “seem to be used” made it clear that there may be an observer bias involved here. Anyway, I’d say I know a lot of linguists - in particular of the diversity linguistics kind - and haven’t heard of a single one who is using OLAC to find language data. It may still be the case, that OLAC data is aggregated elsewhere as well - e.g. in Virtual Language Observatory (VLO) | CLARIN ERIC - but then, I don’t know of many users of this place either.

hp3 · June 22, 2022, 5:46pm

In your experience, where are people going to find their data? Google? What does the discovery path look like? For example what are the tools and data stores in the assumed following user path? :

Question → discovery tool → search results → filtered results → investigated results → acquired data → used data → published answer to question.

xrotwng · June 22, 2022, 6:04pm

Well, I think your assumed user path isn’t traveled too much Many linguists just work with “their own data” - no need for discovery here. Many NLP people wouldn’t know about OLAC - and aren’t too picky about data quality, maybe? For lexical data, for example, they may just go to Wiktionary. At Glottolog, we sometimes get questions about where to find data. So, there are many different paths and not too many users, which makes it difficult for aggregators like OLAC to establish a use case / business model or whatever you want to call it.