Corpus & Computational Linguistics program suggestions

Hello,

I am new to this forum, and looking for recommendations.

I completed six years of fieldwork on Shangan Makhuwa, a dialect of Makhuwa spoken on the Northeast coast of Mozambique, in 2010. Since then, I have experienced many personal obstacles and interruptions, but am now free to return to my work.

In addition to producing a dictionary and grammar on the Shangan dialect, I have also produced over 380 hours of transcribed data, with the aid of six field assistants. I believe the best way to analyze all of this will be via the fields of corpus and computational linguistics. However, I have never worked in these fields before, and so I thought that I would ask if anyone here could recommend a program or programs that I could set up to process data on a little-studied language. It will be an evolving methodology, looking into such things as collocations, topic modeling, and semantic fields, so I would need something fairly open and flexible. I will also continue adding to my dictionary and grammar as I work though the transcripts. All of the transcripts were derived from high-quality WAV files, so I could also go back and reanalyze say, greater phonetic detail, or time alignment, if necessary.

So far, I have come across this book:

“Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy and Keras.” By Bhargav Srinivasa-Desikan.

As well as the program AntConc.

If anyone has any suggestions along these lines, or can recommend someone with experience in using these kinds of programs, that’d be greatly appreciated! It would be nice to be confident of my options before committing much more time to them.

Thank you for your time and attention!
Erik

3 Likes

Hi, welcome to the forum. I’m not an NLP person myself, although there are several here.

One resource you might be interested in is NLTK, which has been around for a while but has a nice book to go along with it that’s free:

https://www.nltk.org/

I believe the library (it’s in Python) is still actively maintained, and if I recall correctly there are resources on this like collocations, etc.

Congrats on all the work you and your colleagues have done, sounds amazing. Feel free to share more about the research here if you like.

1 Like

Thanks Pat, I appreciate your input and encouragement!

After a few more days looking into options, I’m thinking a programming newbie like me might be best off using a program that’s already set up with a user interface, like AntConc. I’ve also just heard of Sketch Engine. The latter appears to rely on online use and storage. As an older guy, I don’t know if I like the idea of doing my work online instead of on my desktop. Does anyone have an opinion about these or other programs? I came across this summary: The Best Free Discourse Analysis Tools - Speak Ai, which doesn’t really give enough insight to make a decision. Looks like I’d have to try out each one a bit to get a sense of which one makes the most intuitive sense to me.

If I go this route, and decide to start with one of more of the user-friendly options, then, once I learn the basics, and try some of the common NLP features on my data, I could then dedicate some more time to learn more about coding if I find I need a feature or command with greater refinement such as those offered by NLTK. In any case, it seems the consensus is that Python is the easiest language to base a more tech savvy approach on, right?

Over the weekend I worked my way through about half of the book I mentioned above, just to get a sense of what goes into these tools. The author uses Python, Gensim, spaCy and Kera. Do you think it would be advantageous to familiarize myself with a more complete package like NLTK right from the start?

As a new user on this forum, I’m not allowed to upload documents, but if anyone is interested, I could email them a dissertation prospectus from 13 years ago(!), just in case it’s interesting and/or helps people on this forum get a better sense of what sort of tools/programs would be of most use to me.

Here’s a snippet of transcribed data as well:

Ph’aamantari ukhuma mmwaani’mmo’mmo Mwiinanenu waHuNnansure, … (4s) vano Hu…HuNnansure t’aamantari mpaka uYookola, mmakhuwani ţo mmyaakoni phw’aamantari’nnye

É que mandava a partir dessa região limite no régulo Nansure, …(4s) então rég…régulo Nansure é que mandava até Yocola, primeiro no interior nas montanhas ele é que estava a mandar.

Thanks for your time!

Erik

1 Like

Hi Erik,

I am one of the NLP persons on this forum.

Unfortunately I don’t know a better way to find a good software tool for what you want to do other than try each out. I have not worked as much in corpus linguistics in endangered languages so I can’t give recommendations.

You could also try joining and posting your question on the Special Interest Group for Endangered Languages that is part of the Association for Computational Linguistics: SIGEL -- Members. We would like to see more non-computational folks making use of the group to their advantage (disclaimer: I’m current president of this group).

Related to that, SIGEL has been trying to put together a “shared activity” that would bring together community members, documentary/descriptive linguists, and computational linguists and foster their collaboration on some concrete outcome that would benefit all three communities. Building a corpus from field data has been brought up by non-computational folks more than once. Corpus linguistics needs annotated data and NLP is dependent on having lots of the same kind of data available. So this seems like a good candidate for the shared activity events. We will eventually need a diverse organizing committee who want to see this happen- another reason to join SIGEL!

Thank you for the reply Sarah! I have just submitted my SIGEL membership application. I look forward to interacting with the group! I haven’t seriously thought about annotation. I could do it for a few select texts, but with over 380 hours of transcriptions, I don’t know if it would be possible to cover most of it. I was planning on using tools like concordancing, topic modeling, keywords, cluster analysis, etc., using lemmas, as my dictionary is organized according to lemmas. For example (some of the formatting didn’t come through):

-RUM- U[15]- uruma enviar, mandar alguém, encarregar; governar || tocar, ouvir um som, produzir ruído ou som (sino, relógio) yaruma oitora - quando tocar oito horas; inooruma itari - trovejou; inaaruma itari - troveja || YO-/ZO- dvrb adn nom yoóruma ordem, mandamento || -RUMI MU-/A- nrumi, arumi criado, servo, súbdito || -RUM-EL- U[15]- urumela servir, obedecer, usar alguma coisa para algum serviço || -RUM-EL-O NI-/MA- nirumelo, marumelo ordem, mando || -RUM-EL-EL- U[15]- urumelela responder || -RUM-EL-IH- U[15]- urumeliha governar, mandar || RUM-Y-A MU-/A- nrumya, arumya mensageiro, servente, enviado de confiança, apóstolo var. rumimaya || -RUM-(M)W- U[15]- urummwa atarefar-se por ordem de alguém || redup rummwarummwa inquietar-se; atarefar-se por ordem de alguém