Are there feeds of new archive deposits?

It occurred to me that it would be interesting to see updates from languge documentation archives (AILLA, ELAR, PARADISEC, etc) when a new deposit is made available, for several reasons:

:raised_hands: We should celebrate our colleagues’ accomplishments!
:mega: We should help being attention to newly documented languages
:classical_building: We should try to learn from how recent archival repositories are put together

I know announcements of this kind go up on Twitter and blogs and mailing lists and stuff, but I figured it might be fun to try to work together to build a little list ourselves.

I’ll see if I can find a few to add below, feel free to add to this list.

1 Like

So I spent some time on this today and honestly, I didn’t find a whole lot. The only announcements I have seen are via Twitter — some archives have blogs, but often those are used more for announcements (conferences, calls for papers, grant news, etc) than for changes to the archives themselves.

When you step back and think about it, it’s kind weird. Wouldn’t one expect archives to be highlighting deposits? Or am I missing stuff?

Had to think about this for a minute, but it seems like OLAC should have this information, and if not a feed, there should be a way to get recent update via search, as in:

http://dla.library.upenn.edu/dla/olac/search.html?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A"Collection"

2 Likes

@joeylovestrand 's solution should work. But you can also “cut out the middleman”, i.e. query the data that OLAC queries as well: An archive’s OAI-PMH data provider. OAI-PMH allows specifying a from parameter for the ListRecords verb, so all PARADISEC records from 2022 are https://catalog.paradisec.org.au/oai/item?verb=ListRecords&from=2022-01-01&metadataPrefix=olac

2 Likes

Oh and an archive’s OAI-PMH “end point” is listed as “Base URL” on OLAC’s archive details page, e.g. OLAC - Archive details

1 Like

I’d still say that an OAI-PMH data provider isn’t exactly “highlighting deposits” :slight_smile:

1 Like

Huh, interesting, thanks @joeylovestrand!

FWIW I did find a feed link in there (RSS):

http://dla.library.upenn.edu/dla/olac/feeds/search.rss?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A"Collection"&

Nice. That makes it pretty easy to generate an HTML page like the one on OLAC dynamically. I wrote a crude little Deno script to do that:

Script to convert OLAC feed into HTML
import { DOMParser, Element } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";


let url = `http://dla.library.upenn.edu/dla/olac/feeds/search.rss?sort=last_update_sort%20desc&showall=sort&fq=dcmi_type_facet%3A%22Collection%22&`

let response =  await fetch(url)
let xml = await response.text()
let dom = new DOMParser().parseFromString(xml,'text/html')


let links = Array.from(dom.querySelectorAll('item'))
.map(item => {
  let link = item.querySelector('link').textContent
  let title = item.querySelector('title').textContent
  let description = item.querySelector('description').textContent || ""
  
  return {link, title, description}
})

let page = `<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Recent Language Archive Deposits</title>
</head>
<body>
<h1>Recent Language Archive Deposits</h1>
  
<ul>

${links.map(link  => `<li><a href="${link.link}">${link.title}</a> ${link.description}</li>`)
.join('\n')}
</ul>

</body>
</html>` 

Deno.writeTextFileSync('archive-feed.html', page)

Crude, but it does what it says on the tin:

http://docling.net/archive-feed.html

Obviously this is sort of pointless given that the page is already online with that exact information; but XML is much easier to parse than HTML. Maybe, for instance, we could figure out a way to publish this feed to this forum automatically.

Man, it’s exciting to have so much expertise in the room. :rocket:

I confess I have never dug into the OLAC docs, and I should have — the URL you link provides more granular data, which could be useful. Considering just the first record:

<record xmlns="http://www.openarchives.org/OAI/2.0/">
  <header>
    <identifier>oai:paradisec.org.au:AC1-220</identifier>
    <datestamp>2022-02-09T22:26:10Z</datestamp>
  </header>
  <metadata>
    <olac:olac xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:dcterms="http://purl.org/dc/terms/"
      xmlns:olac="http://www.language-archives.org/OLAC/1.1/" xsi:schemaLocation="&#xA;          http://www.openarchives.org/OAI/2.0/oai_dc/&#xA;          http://www.openarchives.org/OAI/2.0/oai_dc.xsd&#xA;          http://purl.org/dc/elements/1.1/&#xA;          http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd&#xA;          http://purl.org/dc/terms/&#xA;          http://www.language-archives.org/OLAC/1.1/dcterms.xsd&#xA;          http://www.language-archives.org/OLAC/1.1/&#xA;          http://www.language-archives.org/OLAC/1.1/olac.xsd&#xA;        ">
      <dc:title>Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.</dc:title>
      <dc:identifier>AC1-220</dc:identifier>
      <dc:identifier xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220</dc:identifier>
      <dc:subject xsi:type="olac:linguistic-field" olac:code="language_documentation"/>
      <dcterms:created xsi:type="dcterms:W3CDTF">1970-01-01</dcterms:created>
      <dc:date xsi:type="dcterms:W3CDTF">1970-01-01</dc:date>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.tif</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.jpg</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.wav</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.mp3</dcterms:tableOfContents>
      <dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.eaf</dcterms:tableOfContents>
      <dc:contributor xsi:type="olac:role" olac:code="compiler">Arthur Capell</dc:contributor>
      <dc:contributor xsi:type="olac:role" olac:code="recorder">Arthur Capell</dc:contributor>
      <dc:subject xsi:type="olac:language" olac:code="bpa"/>
      <dc:subject xsi:type="olac:language" olac:code="mwo"/>
      <dc:subject xsi:type="olac:language" olac:code="pgk"/>
      <dc:language xsi:type="olac:language" olac:code="bpa"/>
      <dc:language xsi:type="olac:language" olac:code="mwo"/>
      <dc:language xsi:type="olac:language" olac:code="pgk"/>
      <dc:format>Digitised: yes
Media: LR Audio-tape Type 961. Plastic spool. No tape lead-in. Good condition.
Audio Notes: Operator: Nicholas Fowler-Gilmore 
Tape Machine: StuderA810
Soundcard: RME HDSPe AIO 
A/D Converter: DAD2402 
File: 24bit96kHz, Stereo 
Speed: 3.75ips 
Listening Quality: Good. </dc:format>
      <dc:coverage xsi:type="dcterms:ISO3166">VU</dc:coverage>
      <dc:coverage xsi:type="dcterms:Box">northlimit=-15.026; southlimit=-16.312; westlimit=167.614; eastlimit=168.165</dc:coverage>
      <dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>
      <dc:subject xsi:type="olac:linguistic-field" olac:code="text_and_corpus_linguistics"/>
      <dc:type xsi:type="dcterms:DCMIType">Sound</dc:type>
      <dcterms:accessRights>Open (subject to agreeing to PDSC access conditions)</dcterms:accessRights>
      <dc:rights>Open (subject to agreeing to PDSC access conditions)</dc:rights>
      <dcterms:bibliographicCitation>Arthur Capell (collector), Arthur Capell (recorder), 1970. Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.. TIFF/JPEG/X-WAV/MPEG/XML.  AC1-220 at catalog.paradisec.org.au. https://dx.doi.org/10.4225/72/56E97D93249EF</dcterms:bibliographicCitation>
      <dc:description>Audit of file (20220210) suggests only two languages on this recording, perhaps Rerep (Malekula) and Baiap (at 26:34) . Marked Side 1/2 on box, but on tape, side 1. is identified as side 2. -- Side 1: Revepe (Holvanua), Maewo 'Prodigal Son' - The first is Retep or Pangkumu, an Austronesian dialect of East Malekula, Vanuatu;  Maewo is an island much further north. --  Side 2: Baiap (Ambrym) Word List - Dialect of the Ambryn Island Austronesian language Dakaka, Central Vanuatu.
(no side b). Language as given: Revepe (Holvanua), Maewo, Baiap (Ambrym)</dc:description>
    </olac:olac>
  </metadata>
</record>

So from there we can get to this bit:

Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.

Which is informative but unfortunately not really structured: it’s not clear to me what this means — presumably Revepe is a speaker, and Holvanua a
 place? Or is Maewo Revepe a person’s name, maybe? Etc.

Still, it would be useful to someone who is a specialist in this area to be informed of this data.

As in many other cases with linguistic data there seems to be a lack of transparent re-use cases. OLAC doesn’t seem to be used very systematically by many, and the OAI-PMH data from archives is probably only used by OLAC, so there’s not much feedback on its usability either.

But as I said elsewhere, more people in linguistics in both roles - data creators and data users (also data of others) - could be the way out of this dilemma.

1 Like

Neat. This the <select> on that page turns up something that’s interesting in its own right, a listing of language archives, putting it here for the heck of it


  1. Aboriginal Studies Electronic Data Archive (ASEDA)
  2. Academia Sinica Collections
  3. AfBo: A world-wide survey of affix borrowing
  4. African Language Materials Archive
  5. Alaska Native Language Archive
  6. APiCS Online
  7. Archive of the Indigenous Languages of Latin America (AILLA)
  8. BAS Repository
  9. C’ek’aedi Hwnax Ahtna Regional Linguistic and Ethnographic Archive
  10. California Language Archive
  11. Central Institute of Indian Languages: Publications
  12. CHILDES Data repository
  13. COllections de COrpus Oraux Numeriques (CoCoON ex-CRDO)
  14. Comparative Corpus of Spoken Portuguese
  15. The CrÃÂșbadÃ¥n Project
  16. Dictionaria
  17. A Digital Archive of Research Papers in Computational Linguistics
  18. ELRA Catalogue of Language Resources
  19. Endangered Languages Archive
  20. Ethnologue: Languages of the World
  21. Eurac Research CLARIN Centre
  22. Glottolog 4.5
  23. Graduate Institute of Applied Linguistics Library
  24. ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A. Zampolli”, National Research Council, in Pisa
  25. IULA UPF OAI Archive
  26. Kaipuleohone
  27. The Language Archive
  28. Language Commons Language Corpora
  29. Language Documentation and Conservation
  30. Language resources at the Text Laboratory
  31. LAPSyD
  32. The LDC Corpus Catalog
  33. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÃƥFAL), Faculty of Mathematics and Physics, Charles University
  34. The LINGUIST List Language Resources
  35. Living Archive of Aboriginal Languages
  36. Lund University Humanities Lab corpusserver
  37. Magoria Books’ Carib and Romani Archive
  38. Multimodal Learning and teaching Corpora Exchange
  39. The Natural Language Software Registry
  40. ODIN - The Online Database of Interlinear Text
  41. Oxford Text Archive
  42. Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
  43. Pacific Collection at the University of Hawai’i at Mānoa Hamilton Library
  44. PHOIBLE 2.0
  45. POLLEX-Online
  46. The Rosetta Project: A Long Now Foundation
    Library of Human Language
  47. SAILS Online
  48. SIL Language and Culture Archives
  49. Slovenian language resource repository CLARIN.SI
  50. The Sociolinguistic Archive and Analysis Project (SLAAP)
  51. Speech and Language Data Repository (SLDR/ORTOLANG)
  52. Surrey Morphology Group Databases
  53. TALKBANK Data repository
  54. Tibetan and Himalayan Digital Library
  55. transnewguinea.org
  56. TST-Centrale
  57. The Typological Database Project
  58. U Bielefeld Language Archive
  59. WALS Online
  60. WALS Online RefDB
  61. Webonary Sites
  62. WOLD

Before you get too excited, also check the last column in the table here Open Language Archives Community and the “Current as of” date on details pages like OLAC - Archive details

1 Like

Oh I’m never excited, I’m very blasĂ©. :expressionless:

You mean the fact that so many archives are inactive?

Wait, I do get excited. :crazy_face:

Random note:

Ian Maddieson & company’s LAPSyD phonological typology database has a “recent updates” sidebar (thought it’s not a feed).

But that does turn up recent work — Dahalo was updated a few days ago:

I don’t know if the archives are inactive. Often, I’d guess, the OAI-PMH interface may just be neglected - possibly because OLAC is considered irrelevant? Unfortunately, OAI-PMH is a protocol that’s somewhat cumbersome - but there’s a cheap way to support it called static repository gateway, which I’d guess a couple of the archives are using. You basically just put a file somewhere on a server. That’s cheap, and 
 easy to forget about.

I know that I’m in charge of roughly 25% of the listed archives that could be crawled successfully. And with the exception of Glottolog data in the others rarely changes, and if you wanted to know, you’d rather check CLDF Datasets · GitHub for activity 


2 Likes

Oh, speaking of CLDF Datasets · GitHub : Released versions of these datasets are archived with Zenodo and appear in its cldf-datasets community - which has an OAI-PMH feed: https://zenodo.org/oai2d?verb=ListRecords&set=user-cldf-datasets&metadataPrefix=oai_dc :slight_smile:

1 Like

@pathall your question is not well formed in my opinion. Contrary to your assertion, the data is well formed (structured). It is in a documented XML format. What is not well formed in this case is the contents of the data in the title element. The creation of titles of collections and individual works are under the auspices of individual archives. If an archive allows a crazy title, then OLAC displays a crazy title. I have solidarity with you that the title in this case is not great. But that is not OLAC’s fault, that is PARADISEC’s fault for allowing un-informative titles to be created. Archivists (outside of linguistics) have ample documentation and rules for how collections should be titled. One that I think is really logical is DACS.

1 Like

@xrotwng Can you help me better understand what you mean by “OLAC doesn’t seem to be used very systematically by many”?

1 Like

Ok, I hope the “seem to be used” made it clear that there may be an observer bias involved here. Anyway, I’d say I know a lot of linguists - in particular of the diversity linguistics kind - and haven’t heard of a single one who is using OLAC to find language data. It may still be the case, that OLAC data is aggregated elsewhere as well - e.g. in Virtual Language Observatory (VLO) | CLARIN ERIC - but then, I don’t know of many users of this place either.

2 Likes

In your experience, where are people going to find their data? Google? What does the discovery path look like? For example what are the tools and data stores in the assumed following user path? :

Question → discovery tool → search results → filtered results → investigated results → acquired data → used data → published answer to question.

1 Like

Well, I think your assumed user path isn’t traveled too much :slight_smile: Many linguists just work with “their own data” - no need for discovery here. Many NLP people wouldn’t know about OLAC - and aren’t too picky about data quality, maybe? For lexical data, for example, they may just go to Wiktionary. At Glottolog, we sometimes get questions about where to find data. So, there are many different paths and not too many users, which makes it difficult for aggregators like OLAC to establish a use case / business model or whatever you want to call it.

1 Like