Comparative Pahoturi River Website

katelynnlindsey · May 20, 2020, 3:47pm

I have an idea for an interactive website, but I don’t have any idea how to implement it. Maybe y’all might have done something similar or have some ideas?

One common interest that I have with many community members in Pahoturi River area is in comparative phonology of the six languages. To that end, I have collected approximately 400 words in nine different dialects, transcribed these words, and extracted individual audio files.

What I have right now is an excel sheet that has the word by column and the variety by row and my transcription in IPA in the cells. I also have a folder of .wav files that are all labelled Variety̠_word.wav.

I would like to have an interactive website that allows you to filter by variety/word/IPA symbol, and let you click on the word like a button so that you can hear the audio file. I would like this to be automatically updateable as I add to the excel sheet and add to the folder of .wav files.

It seems both very simple to me and beyond my capabilities! Can this be done in HTML?

pathall · May 20, 2020, 4:46pm

Oh yes, it can be done! And it’s not beyond your capabilities, it’s just beyond your present skill set

You have a very clearly structured data set, so building at least the basics of what you want is going to be pretty easy.

Could you draw a picture of the basic layout you have in mind? Then take a pic and upload it here? (There’s an upload button that has an up arrow.)

Next steps will be to work through your data model (which sounds like it is already pretty complete), and then figure out a repeatable way to import your data.

Once we have a basic rendering working, we can move on to search.

If we play our cards right then all the Mixtecanists in this virtual neighborhood might be interested in trying something similar! (But we can make each project look distinct using CSS.)

katelynnlindsey · May 20, 2020, 5:05pm

Thanks @pathall!

This website (https://soundcomparisons.com/#/en/Englishes/languagesXwords/Lgs_Sln/Wds_Sln) is very cool. The tab Languages x Words is what I was originally thinking as a first step.

Kate

katelynnlindsey · May 20, 2020, 10:16pm

What kind of comparative data do you have for Mixtec?

pathall · May 20, 2020, 10:41pm

I myself don’t really have much to speak of. I just work here! There are however lots of people on this site who work on and/or speak Mixtec variants, however, a ton of work has been going on at UCSB for instance. It was just an idea.

Anyway, back to the Patohuri languages, one way we could proceed would be if you shared data for a few cognates. (Four or five, say.) Then we could work on something that would load that data and render it as a table like the one you linked to.

Also, I just found your Paradisec deposit on Ende and other Pahituri languages it’s amazing!

katelynnlindsey · May 20, 2020, 11:11pm

Here’s an example (can be .xls or tab delimited text file):

	Ende_Kinkin_MGG_KL	Taeme_Kinkin_MGG_KL	Aɡob_Kibuli_NPG_KL	Em_Bititi_SBB_KL	Kawam_Wim_GYY_KL

head YamFinder bun bɪn bun bun bun
hair (head) YamFinder kom kʷam kom kom bunkom
nose YamFinder mɪɽɪŋ wɪɖ muɽuŋ muɽuŋ muruŋ
jaw YamFinder ʈaʈ ʈaʈ teb ʈaʈ t͡ʃæt͡ʃ
mouth YamFinder bod bod ume.ʈɵp bod bod
tongue YamFinder dəɡmar dəŋmer dɵɡmer dərɡmoɽ dɪrɡmer

katelynnlindsey · May 20, 2020, 11:17pm

Hm not sure that formatted in a way you could copy that into an appropriate file.

English	Category	Ende_Kinkin_MGG_KL	Taeme_Kinkin_MGG_KL	Aɡob_Kibuli_NPG_KL	Em_Bititi_SBB_KL	Kawam_Wim_GYY_KL
head	YamFinder	bun	bɪn	bun	bun	bun
hair (head)	YamFinder	kom	kʷam	kom	kom	bunkom
nose	YamFinder	mɪɽɪŋ	wɪɖ	muɽuŋ	muɽuŋ	muruŋ
jaw	YamFinder	ʈaʈ	ʈaʈ	teb	ʈaʈ	t͡ʃæt͡ʃ
mouth	YamFinder	bod	bod	ume.ʈɵp	bod	bod
tongue	YamFinder	dəɡmar	dəŋmer	dɵɡmer	dərɡmoɽ	dɪrɡmer

katelynnlindsey · May 20, 2020, 11:19pm

Shoot, I tried to do the table the way you described but that didn’t work either. Wish I could attach a .txt file.

pathall · May 21, 2020, 1:27am

No worries! We can make this work probably. I will futz with the upload settings (my bad) tomorrow. Thank you for sharing with us and let’s get to work tomorrow!

Update: there was a tiny error in your table above that I took the liberty of fixing. Looks great now! Except for the YamFinder mystery

katelynnlindsey · May 21, 2020, 4:32pm

Fixed the table! Yamfinder makes sense now

pathall · May 21, 2020, 4:43pm

Haha great! I was in amazement of a languge that had the form /jamfinder/ for so many glosses

I’ll be back later this afternoon to talk proverbial turkey.

pathall · May 21, 2020, 8:21pm

So how are the audio files named?

katelynnlindsey · May 21, 2020, 8:37pm

Agob_Kibuli_KUK_KL_canoe.wav (Variety_word.wav). I avoided putting the transcription in the file name because (a) the transcriptions change, and (b) IPA symbols.

pathall · May 21, 2020, 10:14pm

Indeed, makes sense, thanks.

katelynnlindsey · May 26, 2020, 7:18pm

Update: I’ve identified an RA who wants to help get this website up and running as a summer project. I’m not sure how much HTML she knows, but if we point her in the right direction, I think she could do it!

katelynnlindsey · June 7, 2021, 1:47pm

Well, it’s been a year but I’ve found new inspiration to continue this project.

The Yamfinder database that kicked off this data collection is now back up and running (http://yamfinder.com/). It’s not the exact format that I would like to see for the Pahoturi River database (e.g., Sound Comparisons...) but it would be great if I could both get my data in good shape for inclusion on the Yamfinder site and even pull from the Yamfinder site to get the data in a more useful format.
I just hired an RA for 120 hours of work to make this happen.

What I know about the Yamfinder site:

You can easily download data in a .csv file, but it doesn’t look like you can download the audio with it. I’ve written to Matt (Carroll) to see if I can get the audio pulled too.

pathall · June 7, 2021, 3:37pm

Very cool @katelynnlindsey! Was just looking at this and chatting with @meaganvigus about it. I wonder if https://www.matthewjcarroll.com/ might be interested in joining us here to talk about the project too?

katelynnlindsey · June 7, 2021, 4:08pm

I’ll send him the link

mjcarroll · June 7, 2021, 11:03pm

Hi Everybody! Thanks for inviting me to join the conversation.

Just a quick (edit: ha!) post about datasets, websites and how we serve data. After a few years working on various large data collection projects, I have become very skeptical of the usefulness of sites like soundcomparisons or the old version of Yamfinder (which preceded soundcomparisons). Not to diminish the achievements of these sites, they are beautiful sites with some amazing features.

However, I have found that once a researcher starts to really analyse the data they typically will download the data into their own workflows for the following reasons (and many others):

There are so many tools for analysing and visualising data these days, there is no point trying to replicate these on a website. You can do more with just excel than you would ever want to include on a website, never mind the data analytics power of python, R, Watson, SPSS, etc…
Each person has their own workflow derived from the way they think, their research questions and the types of patterns visible in the data.
Any level of automation / custom scripting will require a download the data

IMO this leaves these websites as better suited for casually browsing the data and serving as public facing points for comparative projects consider something like the 50 words project (https://50words.online/).

The old Yamfinder site, which we started in 2012, took hundreds of hours of custom development and iteration and in the end most people just exported the data to excel (#facepalm). My current philosophy is that datasets are better published on places like Zenodo (https://zenodo.org/) or Github where they get a DOI and you dont have to pay for server space (unlike Yamfinder) and websites databases should really just be a place to view and download the data.

I really hope this post doesn’t sound too dogmatic or arrogant. I just worry that across our discipline so many projects have spent thousands of hours and research dollars developing custom databases for each project when in 90% of cases Zenodo is a better choice.

Back to your original post:

I would like to have an interactive website that allows you to filter by variety/word/IPA symbol, and let you click on the word like a button so that you can hear the audio file. I would like this to be automatically updateable as I add to the excel sheet and add to the folder of .wav files.

In this case, you can embed sound files in excel. If you need it online, i.e. for multiple researchers, you could use google sheets or excel 365 (although it doesn’t allow embedding of sound files but you can link to files hosted on github or somewhere else).

Anyone is also welcome to use what we have done for Yamfinder for their own project. It is fairly trivial to change the data structure and the display. I would be more than happy to help, we’ll just need to double check with Wolfgang who did most of the original coding but I am have no doubt he would be fine with that.

Sorry for the brain dump but I hope you find some of what I said useful : )

pathall · June 8, 2021, 1:17am

Hi @mjcarroll! Welcome aboard! And thanks for pointing him this way, @katelynnlindsey Looking back I totally implied I was going to work with you and then did nothing! life.

Anyway, so many interesting observations here, @mjcarroll. I definitely agree on the availability of simple data being a huge plus in a project, and that all kinds of tools are useful in linguistic analysis. My opinion on software for documentation is, if it helps someone do language work of any kind — research, revitalization, pedagogy, whatever — it’s a net positive.

I don’t see it as a question of replication. There are some features of the web that essentially no other analytical tools offer: advanced layout (CSS grid, flexbox, incredible (and constantly improving) Unicode support, writing modes, and on and on.

Certainly, research patterns vary from person to person, and the web platform is not always the best home for certain kinds of research. Stats? Probably better off using R. Machine learning and stuff like that? Probably Python. And so on for several of the other tools you mention.

But those tools don’t match the accessibilty of the web. Just installation alone (or cost) can be a significant barrier.

I actually would love to hear more about this history. I tried looking up yamfinder in the Wayback Machine but couldn’t find any old versions

Certanly neither dogmatic nor arrogant. Science needs lots of viewpoints after all. I confess I have never really dug into Zenodo, although it‘s been mentioned here here and here — @rgriscom a local guru on that topic.

This discussion right here… dang, this touches on so many of the issues we face as a field right now. I think the best way to start is to try to enumerate a set of desiderata — the solutions will be interrelated, but

Desiderata

Online - We want documentation to be widely available (where appropriate). Hosting is a hard problem.
Linked, playable media - it should be possible to get playback next to the transcriptions
Collaboration - several people should be able to update the content. Authentication and security are hard problems.
Searchable/filterable/interactive - Online documentation should be more useful than a print equivalent. Even beyond playable media, we want to be able to do stuff with documentation.

This things are all pretty complicated. For some problems, a shared Excel/Office 365 whatever online spreadsheet could be fine (for instance, say, historical comparison). But for making research available to a speech community, for example, or for pedagogical purposes, Excel is going to be less ideal than the kind of thing https://www.yamfinder.com/ is providing.

I hope we can continue this discussion (perhaps in a separate topic so this one can stay related to the Pahoturi River languages content), because there are many paths to meeting all these desiderata (and others). What is most important, I think, is that we embrace experimentation and variation.