It occurred to me that it would be interesting to see updates from languge documentation archives (AILLA, ELAR, PARADISEC, etc) when a new deposit is made available, for several reasons:
We should celebrate our colleaguesâ accomplishments! We should help being attention to newly documented languages We should try to learn from how recent archival repositories are put together
I know announcements of this kind go up on Twitter and blogs and mailing lists and stuff, but I figured it might be fun to try to work together to build a little list ourselves.
Iâll see if I can find a few to add below, feel free to add to this list.
So I spent some time on this today and honestly, I didnât find a whole lot. The only announcements I have seen are via Twitter â some archives have blogs, but often those are used more for announcements (conferences, calls for papers, grant news, etc) than for changes to the archives themselves.
When you step back and think about it, itâs kind weird. Wouldnât one expect archives to be highlighting deposits? Or am I missing stuff?
Had to think about this for a minute, but it seems like OLAC should have this information, and if not a feed, there should be a way to get recent update via search, as in:
Obviously this is sort of pointless given that the page is already online with that exact information; but XML is much easier to parse than HTML. Maybe, for instance, we could figure out a way to publish this feed to this forum automatically.
Man, itâs exciting to have so much expertise in the room.
I confess I have never dug into the OLAC docs, and I should have â the URL you link provides more granular data, which could be useful. Considering just the first record:
<record xmlns="http://www.openarchives.org/OAI/2.0/">
<header>
<identifier>oai:paradisec.org.au:AC1-220</identifier>
<datestamp>2022-02-09T22:26:10Z</datestamp>
</header>
<metadata>
<olac:olac xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:olac="http://www.language-archives.org/OLAC/1.1/" xsi:schemaLocation="
 http://www.openarchives.org/OAI/2.0/oai_dc/
 http://www.openarchives.org/OAI/2.0/oai_dc.xsd
 http://purl.org/dc/elements/1.1/
 http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd
 http://purl.org/dc/terms/
 http://www.language-archives.org/OLAC/1.1/dcterms.xsd
 http://www.language-archives.org/OLAC/1.1/
 http://www.language-archives.org/OLAC/1.1/olac.xsd
 ">
<dc:title>Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.</dc:title>
<dc:identifier>AC1-220</dc:identifier>
<dc:identifier xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220</dc:identifier>
<dc:subject xsi:type="olac:linguistic-field" olac:code="language_documentation"/>
<dcterms:created xsi:type="dcterms:W3CDTF">1970-01-01</dcterms:created>
<dc:date xsi:type="dcterms:W3CDTF">1970-01-01</dc:date>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.tif</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_01.jpg</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.tif</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_03.jpg</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.tif</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_05.jpg</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.tif</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_04.jpg</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.tif</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-IMG_02.jpg</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.wav</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.mp3</dcterms:tableOfContents>
<dcterms:tableOfContents xsi:type="dcterms:URI">http://catalog.paradisec.org.au/repository/AC1/220/AC1-220-A.eaf</dcterms:tableOfContents>
<dc:contributor xsi:type="olac:role" olac:code="compiler">Arthur Capell</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="recorder">Arthur Capell</dc:contributor>
<dc:subject xsi:type="olac:language" olac:code="bpa"/>
<dc:subject xsi:type="olac:language" olac:code="mwo"/>
<dc:subject xsi:type="olac:language" olac:code="pgk"/>
<dc:language xsi:type="olac:language" olac:code="bpa"/>
<dc:language xsi:type="olac:language" olac:code="mwo"/>
<dc:language xsi:type="olac:language" olac:code="pgk"/>
<dc:format>Digitised: yes
Media: LR Audio-tape Type 961. Plastic spool. No tape lead-in. Good condition.
Audio Notes: Operator: Nicholas Fowler-Gilmore
Tape Machine: StuderA810
Soundcard: RME HDSPe AIO
A/D Converter: DAD2402
File: 24bit96kHz, Stereo
Speed: 3.75ips
Listening Quality: Good. </dc:format>
<dc:coverage xsi:type="dcterms:ISO3166">VU</dc:coverage>
<dc:coverage xsi:type="dcterms:Box">northlimit=-15.026; southlimit=-16.312; westlimit=167.614; eastlimit=168.165</dc:coverage>
<dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>
<dc:subject xsi:type="olac:linguistic-field" olac:code="text_and_corpus_linguistics"/>
<dc:type xsi:type="dcterms:DCMIType">Sound</dc:type>
<dcterms:accessRights>Open (subject to agreeing to PDSC access conditions)</dcterms:accessRights>
<dc:rights>Open (subject to agreeing to PDSC access conditions)</dc:rights>
<dcterms:bibliographicCitation>Arthur Capell (collector), Arthur Capell (recorder), 1970. Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.. TIFF/JPEG/X-WAV/MPEG/XML. AC1-220 at catalog.paradisec.org.au. https://dx.doi.org/10.4225/72/56E97D93249EF</dcterms:bibliographicCitation>
<dc:description>Audit of file (20220210) suggests only two languages on this recording, perhaps Rerep (Malekula) and Baiap (at 26:34) . Marked Side 1/2 on box, but on tape, side 1. is identified as side 2. -- Side 1: Revepe (Holvanua), Maewo 'Prodigal Son' - The first is Retep or Pangkumu, an Austronesian dialect of East Malekula, Vanuatu; Maewo is an island much further north. -- Side 2: Baiap (Ambrym) Word List - Dialect of the Ambryn Island Austronesian language Dakaka, Central Vanuatu.
(no side b). Language as given: Revepe (Holvanua), Maewo, Baiap (Ambrym)</dc:description>
</olac:olac>
</metadata>
</record>
So from there we can get to this bit:
Revepe (Holvanua), Maewo 'Prodigal Son'; Baiap (Ambrym) Word List.
Which is informative but unfortunately not really structured: itâs not clear to me what this means â presumably Revepe is a speaker, and Holvanua a⊠place? Or is Maewo Revepe a personâs name, maybe? Etc.
Still, it would be useful to someone who is a specialist in this area to be informed of this data.
As in many other cases with linguistic data there seems to be a lack of transparent re-use cases. OLAC doesnât seem to be used very systematically by many, and the OAI-PMH data from archives is probably only used by OLAC, so thereâs not much feedback on its usability either.
But as I said elsewhere, more people in linguistics in both roles - data creators and data users (also data of others) - could be the way out of this dilemma.
Neat. This the <select> on that page turns up something thatâs interesting in its own right, a listing of language archives, putting it here for the heck of itâŠ
Aboriginal Studies Electronic Data Archive (ASEDA)
Academia Sinica Collections
AfBo: A world-wide survey of affix borrowing
African Language Materials Archive
Alaska Native Language Archive
APiCS Online
Archive of the Indigenous Languages of Latin America (AILLA)
BAS Repository
Câekâaedi Hwnax Ahtna Regional Linguistic and Ethnographic Archive
California Language Archive
Central Institute of Indian Languages: Publications
CHILDES Data repository
COllections de COrpus Oraux Numeriques (CoCoON ex-CRDO)
Comparative Corpus of Spoken Portuguese
The CrĂÂșbadĂÂĄn Project
Dictionaria
A Digital Archive of Research Papers in Computational Linguistics
ELRA Catalogue of Language Resources
Endangered Languages Archive
Ethnologue: Languages of the World
Eurac Research CLARIN Centre
Glottolog 4.5
Graduate Institute of Applied Linguistics Library
ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics âA. Zampolliâ, National Research Council, in Pisa
IULA UPF OAI Archive
Kaipuleohone
The Language Archive
Language Commons Language Corpora
Language Documentation and Conservation
Language resources at the Text Laboratory
LAPSyD
The LDC Corpus Catalog
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ĂĆĄFAL), Faculty of Mathematics and Physics, Charles University
The LINGUIST List Language Resources
Living Archive of Aboriginal Languages
Lund University Humanities Lab corpusserver
Magoria Booksâ Carib and Romani Archive
Multimodal Learning and teaching Corpora Exchange
The Natural Language Software Registry
ODIN - The Online Database of Interlinear Text
Oxford Text Archive
Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
Pacific Collection at the University of Hawaiâi at MĂÂnoa Hamilton Library
PHOIBLE 2.0
POLLEX-Online
The Rosetta Project: A Long Now Foundation
Library of Human Language
SAILS Online
SIL Language and Culture Archives
Slovenian language resource repository CLARIN.SI
The Sociolinguistic Archive and Analysis Project (SLAAP)
Speech and Language Data Repository (SLDR/ORTOLANG)
I donât know if the archives are inactive. Often, Iâd guess, the OAI-PMH interface may just be neglected - possibly because OLAC is considered irrelevant? Unfortunately, OAI-PMH is a protocol thatâs somewhat cumbersome - but thereâs a cheap way to support it called static repository gateway, which Iâd guess a couple of the archives are using. You basically just put a file somewhere on a server. Thatâs cheap, and ⊠easy to forget about.
I know that Iâm in charge of roughly 25% of the listed archives that could be crawled successfully. And with the exception of Glottolog data in the others rarely changes, and if you wanted to know, youâd rather check CLDF Datasets · GitHub for activity âŠ
@pathall your question is not well formed in my opinion. Contrary to your assertion, the data is well formed (structured). It is in a documented XML format. What is not well formed in this case is the contents of the data in the title element. The creation of titles of collections and individual works are under the auspices of individual archives. If an archive allows a crazy title, then OLAC displays a crazy title. I have solidarity with you that the title in this case is not great. But that is not OLACâs fault, that is PARADISECâs fault for allowing un-informative titles to be created. Archivists (outside of linguistics) have ample documentation and rules for how collections should be titled. One that I think is really logical is DACS.
Ok, I hope the âseem to be usedâ made it clear that there may be an observer bias involved here. Anyway, Iâd say I know a lot of linguists - in particular of the diversity linguistics kind - and havenât heard of a single one who is using OLAC to find language data. It may still be the case, that OLAC data is aggregated elsewhere as well - e.g. in Virtual Language Observatory (VLO) | CLARIN ERIC - but then, I donât know of many users of this place either.
In your experience, where are people going to find their data? Google? What does the discovery path look like? For example what are the tools and data stores in the assumed following user path? :
Question â discovery tool â search results â filtered results â investigated results â acquired data â used data â published answer to question.
Well, I think your assumed user path isnât traveled too much Many linguists just work with âtheir own dataâ - no need for discovery here. Many NLP people wouldnât know about OLAC - and arenât too picky about data quality, maybe? For lexical data, for example, they may just go to Wiktionary. At Glottolog, we sometimes get questions about where to find data. So, there are many different paths and not too many users, which makes it difficult for aggregators like OLAC to establish a use case / business model or whatever you want to call it.