Presenting the Wiktionary Data Preparer (WDP)

aryaman · January 27, 2021, 4:50pm

Recently, I and @lgessler worked with @Hilaria to upload her corpus data and audio recordings for the variety of Chatino from San Juan Quiahije onto Wiktionary (this post). After that, I did another project to upload Henrik Liljegren’s dictionary of the Palula language of Pakistan (~3k entries) onto Wiktionary.

We’ve since realised that uploading a lexicon to Wiktionary is a hassle for those who do not know the strict formatting standards. So, we have been working on a Python module that can handle all the formatting for you as long as you can parse your data into a structured format!

To rehash for those who are unfamiliar, I am an admin on the English Wiktionary, which serves as a collaborative multilingual dictionary built on the model of Wikipedia. I’m interested in making our language documentation data more accessible to the public through Wiktionary.

Presenting: the Wiktionary Data Preparer!

WDP is a module for parsing structured data from a lexicon and preparing it for upload to Wiktionary. Most of the code was written by @lgessler with minor direction and edits by me. We’ve linked to the GitHub repo above so you can see the code for yourself.

In WDP, we treat each entry in a lexicon as a Word object. WDP Words have convenient methods for adding the kind of information that makes up a Wiktionary entry. Here’s a minimal example:

Expand for code!

from wdp import Word, format_entries, export_words

# use the Word class to represent our words
apple = Word("apple")
apple.add_pronunciation("/ˈæp.əl/", notation="IPA")
apple.add_definition("A common, round fruit", "Noun")
apple.add_definition("A tree of the genus Malus", "Noun")
apple.set_etymology("Old English æppel < Proto-Germanic *ap(a)laz < PIE *ab(e)l-")

# put all our words in a list
wdp_words = [apple, ...]

# Generate Wiktionary markup from our entries
formatted_entries = format_entries(wdp_words, "en", "English")

# Perform the upload
from wdp.upload import upload_formatted_entries
upload_formatted_entries(formatted_entries, "English")

Here’s what that looks like once uploaded (that last line of code does that):

You can read the docs for WDP and go ahead and download it to test! It’s on pip:

$ pip install wdp

Testing WDP

I’ve been doing language documentation of Kholosi since last semester. Kholosi is a marginalised Indo-Aryan language of Iran, with speakers mainly in two villages (Kholos and Gotaw) and many more distributed throughout Iran. Speakers number in the thousands. Currently working on a submission to JIPA for that (And as an aside, you can see some of my data here.)

I collected a lexicon of ~400 words in the course of my elicitation. This was in Google Sheets, and I wanted to upload it to Wiktionary. So of course, I thought to use WDP as a demonstration.

The CSV lexicon.

After exporting the lexicon to an easily machine-readable CSV file, I wrote a short Python script that parses my data. You can see that code here! In only about ~100 lines (took just last night to write) I was ready to upload data that would take days to manually add to Wiktionary. The upload itself took only a minute.

For testing purposes, I only uploaded to my userspace on Wiktionary, not the main Wiktionary space. You can see the results in User:AryamanBot’s edit history!

The only extra step needed was a user-config.py in the same directory. The Python interface for wikis (pywikibot) has some special guidelines for those; you can see the code for that here. You also need an approved (through a public vote) bot account on Wiktionary if you want to do it yourself. The easier way is to export a zip using wdp.export_words() and send it to one of us to run the upload.

Future steps

There are still many features to add to WDP. Some of what we have in mind:

Automatic linking of English words in definition lines using a lemmatiser.
Handling of semantic relations (synonyms/antonyms/etc.).
Better support for inflectional data in the form of pretty tables.
Code-free upload from FLEx Dictionary XML or other popular formats.

So, what do you think? Feel free to suggest any useful features or other thoughts! And we’re happy to collaborate with you if you want to contribute to the codebase, or have a lexicon to upload to Wiktionary

pathall · January 28, 2021, 2:48pm

Been looking through all this coolness, thanks for the great writeup!

So this enables a much faster path to getting lexical material into Wiktionary for someone who is in the situation that @Hilaria or Henrik was in. Very awesome. Making an easier path to getting working wikitext is no small problem.

Random thought, I wonder if it would be possible to do a Javascript version of this for the case where someone is working with a smaller amount of data, and just wants a doohickey to generate Wikitext for an entry or two.

The entries look great. The way you can drill down via semantic domains is cool. Is there a way to see everything at once?

& slightly off the topic of WDP itself…

Also, aside from the WDP, I just love the fact that these interfaces are from data that you’re actively working on:

https://aryamanarora.github.io/kholosi/dictionary.html

https://aryamanarora.github.io/kholosi/sentences.html

This markup is . Words get nodes, so things can wrap correctly.

<div class="card-text gloss--glossed" data-gloss="1">
  <div class="gloss__words">
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">deh</p>
      <p class="gloss__line gloss__line--1">village</p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">sonji</p>
      <p class="gloss__line gloss__line--1"><abbr class="gloss__abbr" title="first person">1</abbr><abbr
          class="gloss__abbr" title="plural">PL</abbr>.<abbr class="gloss__abbr" title="genitive">GEN</abbr></p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">mô</p>
      <p class="gloss__line gloss__line--1"><abbr class="gloss__abbr" title="locative">LOC</abbr></p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">kozoro</p>
      <p class="gloss__line gloss__line--1">man</p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">yu</p>
      <p class="gloss__line gloss__line--1">be.<abbr class="gloss__abbr" title="past">PST</abbr></p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">jo</p>
      <p class="gloss__line gloss__line--1"><abbr class="gloss__abbr" title="relative">REL</abbr></p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">nôyos</p>
      <p class="gloss__line gloss__line--1">name.?</p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">ali</p>
      <p class="gloss__line gloss__line--1">Ali</p>
    </div>
    <div class="gloss__word">
      <p class="gloss__line gloss__line--0">yu</p>
      <p class="gloss__line gloss__line--1">be.<abbr class="gloss__abbr" title="past">PST</abbr></p>
    </div>
  </div>
  <p class="gloss__line--hidden">deh sonji mô kozoro yu jo nôyos ali yu</p>
  <p class="gloss__line--hidden">village 1PL.GEN LOC man be.PST REL name.? Ali be.PST</p>
  <p class="gloss__line--free gloss__line gloss__line--2">In our village there was a man whose name was Ali.</p>
</div>

How do you like using leipzig.js for handling interlinears? (That seems to be what’s being used here?)

(Though I must admit I don’t love the column-count: 3 CSS rule on the sentences page, it’s hard to read! )

Anyway, congrats on this, I hope Wiktionary gets stuffed full of more language!

sunny · February 4, 2021, 6:44pm

Ooooo this is awesome! I think a decent way to handle synonyms might be through looking to an API of either wikt itself or maybe another dictionary thing like thefreedictionary or dictionary dot com! One thing of now is that there are some languages that have a pretty consistent phoneme-graphene ratio such that you don’t need to add in a transcription for the pronunciation, just the word itself. So I’m wondering if this could handle that, and other fields that can be automatically filled. Spanish does this with pronunciations and links to the RAE (you just put in a thing saying “give a reference” in the references section and it’ll pop in a link to the RAE page).
Also, choosing which words to link in the definitions would be an interesting challenge. Malus in hat example needs to be italicized to begin with (which I’m guessing can be handled on the user’s data), but it should link to Wikipedia or Wikispecies. Definition (1) wouldn’t want “round” linked probably, but yes to “fruit”. I don’t have a solution but that’s a fun conundrum. Definition want to link every etymon bc having red links is perfectly fine there—literally can just say apple.set_etymology(“Old English [[æppel|lang=oe]] < Proto-Germanic [[*ap§laz]]…”). I realize my linking syntax might be off but I hope the idea is coming across? Like, linking can be up to the user to an extent for definitions, but should/could be fairly necessary for etymology. I actually might be able to contribute to this repo honestly. Would be fun~

lgessler · February 8, 2021, 10:50pm

Yup, all of those extensions you propose (synonym retrieval, autotranscription, linking) would help a lot! Please do feel free to hack on the repo and send a PR

For some of these things, there might already be solutions in place. For instance I know @aryaman has worked on a transliteration script for Devanagari (or at least, Hindi in Devanagari) that runs on Wiktionary itself. But an advantage of doing it inside wdp would be that you’d be able to preview changes before you actually make them. And having it in wdp could make it more discoverable for users. Many design decisions!

Zara · March 3, 2021, 11:55pm

Interesting work and very empowering - does it have a reverse button for pulling this data down from wiktionary for use in analysis?

pathall · March 4, 2021, 12:01am

That’s a great question Zara I’ll go ahead and tag @lgessler to make sure he sees it. Happy to see you here!

lgessler · March 4, 2021, 12:05am

That’s an interesting idea—much of the low-level work that would go into functionality like that has already been done (specifically by the pywikibot package), so implementing this wouldn’t be terribly difficult. This is definitely something we can consider for the future if there’s demand for it. Tagging @aryaman

edit: by the way, if anyone who knows Python in interested in jumping on it and taking this on I’d be happy to advise you on how to approach it!

aryaman · March 20, 2021, 8:39pm

The reverse is handled by a couple of Python modules with various levels of success; I’ve used wiktionaryparser before and it was decent, but the output is not well structured. I think there’s definitely room for a good general-purpose Wiktionary data-extraction module, given how much Wiktionary is used for NLP purposes (and it can build upon more focused projects like WikiPron and UniMorph which extract structured data from Wiktionary).

mcswell · April 25, 2021, 3:09am

Several years ago, I attempted to extract some dictionaries from Wiktionary into a form that could be used for downstream processing. I basically failed; aryaman alludes to the same difficulty.

The problem is that the Wiktionary is (like HTML) a visual description–it tells how the page should look–rather than a content format–telling what the different pieces mean. So it’s hard for a computer to abstract the content: headword, part of speech, variant forms with their inflectional features, synonyms, etymology, senses and so forth. Much better would be a content-based system, from which a display could be derived.

There are of course content-based formats for dictionaries, like the ISO LMF, or SIL’s LIFT.