🔨 A little project: Extracting interlinears from Wikipedia

pathall · January 27, 2022, 2:04am

I did a little procrasti-project today… you know, when you have stuff you must do, and your says “! Don’t do that sensible thing!”), and you end up putting your mental energies into something else… y’all know what I’m talking about. Might as well make something of it.

I’m sure you all spend a fair amount of time looking at linguistics and language stuff on Wikipedia. It occurred to me that there is actually a fair amount of fairly well-structured interlinear text there. Wouldn’t it be cool to try to get it all into some kind of queryable form? So I had to start looking into it.

First thing was just to poke around and try to find some. I happened to be looking at the article on the language of the Shawnee people, also called Shawnee, a central Algonquian language:

There are maybe 30 interlinears on that page — not a huge database, but not nothing.

Should we scrape the `HTML`?

Scraping is the programming word for writing a program that picks apart the bits and pieces of an HTML page. Scraping can be painstaking or straightforward, depending on how the HTML is written.

So, let’s just pick a random bit of the page which has some significant interlinear content:

The way to get a feel for what the markup looks like for this particular bit is to “inspect” the markup. We used to use a feature called “Viewing the source”, but “Inspecting the source” is a much better way to learn about how HTML works in practice. (Here’s a tutorial on how to do that: How to try Javascript now.)

…But in this case, the markup is not super informative — you can see in the screenshot below that I’m inspecting example (1), but all the tags are just <div>s, without any classes to distinguish how the levels are set up. That’s going to be a royal pain to parse. So let’s… just not try to parse the HTML.

Wikipedia is made of `wikitext`

Oh boy. Wikitext. Love it, hate it… It will never go away because the biggest human-written anything ever, Wikipedia, is inextricably bound up in its complexities. wikitext is what you see when you edit an article. It is not… well, is fugly.

The pattern goes like this:

[graphviz]
strict digraph {
RANK=LR
Wikitext [shape=“polygon” style=“filled” ]
Parser [shape=“polygon” style=“filled” ]
HTML [shape=“polygon” style=“filled” ]
Wikitext → Parser → HTML
}
[/graphviz]

For most practical purposes, the Parser that converts Wikitext into HTML is a black box. It’s like an evolved organism. It’s bonkers. But the input to that parser is at least kind of understandable, if you can control your rage at how bewildering it is. So rather than inspecting the HTML, let’s see what the wikitext corresponding to the this stranger… example looks like. How do you do that? Well, you click edit. It’s easier to just edit a single section at a time, so we’ll click the edit link next to Demonstrative pronouns:

Which gets us:

===Demonstrative pronouns===
Refer to the examples below. 'Yaama' meaning 'this' in examples 1 and 2 refers to someone in front of the speaker. The repetition of 'yaama' in example 1 emphasizes the location of the referent in the immediate presence of the speaker.

{{interlinear|number=(1)|glossing=no abbr
|yaama- kookwe- nee -θa -yaama
|this- strange- appearing -PERSON -this
|'this stranger (the one right in front of me)'}}

{{interlinear|number=(2)|glossing=no abbr
|mata- yaama- ha'- pa-skoolii -wi   ni-oosθe' -0a
|not this TIME- go-school -AI       1-grandchild -PERSON
|'this grandchild of mine does not go to school'}}

…etc…

Okay, that’s not too bonkers. At least we can see that there are chunks, and each interlinear begins and ends with double curly brackets. And the first line looks like this:

{{interlinear|number=(1)|glossing=no abbr

Okay, so it starts with {{interlinear, makes sense. In fact, this is what’s called a Template in Wikipedia parlance. If I understand correctly (correct me @sunny!), a template is basically a kind of syntax for indicating that the content “within” should be transformed before being handed to the parser. So it goes like this:

[graphviz]
strict digraph {
RANK=LR
WikitextWithTemplates [shape=“polygon” style=“filled” ]
Parser [shape=“polygon” style=“filled” ]
Wikitext [shape=“polygon” style=“filled” ]
Parser [shape=“polygon” style=“filled” ]
HTML [shape=“polygon” style=“filled” ]
WikitextWithTemplates → Parser → Wikitext → Parser → HTML
}
[/graphviz]

Or I mean, I dunno if that’s how it actually goes down, the point is, by the time you see the rendered HTML page, the template has been transformed.

The interlinear template

In point of fact, Wikipedia’s interlinear template is powerful. Real powerful. It can do an awful lot of stuff. Which means the “syntax” of the interlinear template gets a little hairy in its own right.

But just check out the documentation for the interlinear template:

Wouldja look at that. Pretty glosses! Small caps! You can add your own abbreviations. It lines up the words right! There is numbering! You can tweak stuff! It is, in short, pretty impressive.

But I don’t want all the other stuff, just the templates

So, maybe we just parse the wikitext and slurp out all the instances of the {{interlinear}} template? Well, yeah. That’s what I did. I wrote a little app, and stuck it here:

So what you do is, you paste all the wikitext from a page into the left panel, and it tries to slurp out the interlinear templates and turn them into plain ’ol text in the right panel.

Like this:

The “extraction” code tries to fix a few things, but I know for a fact that it does some bad things. But I’ve found that you have to start somewhere, and it’s better to try to make something

A clunky workflow (but some results)

So I ended up doing this… uh… 70 times.

Search Wikipedia for pages using this query (see these docs)
Open a bunch of tabs with all those articles
Edit each one, cut and paste the content
Paste it into the extractor
Cut and paste the output into a file, save that.

Like I say, clunky. I’m a little obsessive about things like this, though, once I get started. And honestly I found it kind of fun, even just glancing at all those articles. The real fun should be trying to do something with them all, however.

How could it be better?

Well, I’m going to wrap it up for tonight, but there are lots of things that could be done:

Try it on Wikipedias in other languages (Are template names in the wikitext the same on Wikipedias in every language?)
Put this stuff in a github repo instead of cramming it in articles here.
Figure out how to download all the articles at once and run the extraction offline

So that’s really it.

When ideas for low-hanging fruit like this get into my head, I feel pretty much compelled to hack something together. I’m curious to know if anyone else finds this project interesting, or has any ideas about how to improve the workflow or make use of the output.

Gnight friends.

sunny · January 27, 2022, 5:19am

Yeah this rocks! And I don’t have any important corrections on the template explanation, and I don’t think I have much in the way of productive discussion to add, just some rambling–
Like you said, there’s a lot of directions for this kind of thing but it is a resolvable problem with a variety of solutions that can all feed into different use cases. I slapped in some examples from the Austronesian alignment page and I got this:

so of course there’s stuff that’s going wonky here and it has to do with additional formatting largely. Just on intuition alone, I might presume that it would be fairly easy to ditch that in a line or two of JS. But do we even want examples like that, with no gloss and only a free translation? idk, it depends on the end goal. But if we’re automating scraping, it’ll be important to know that many of the examples, even if they use {{template:interlinear}}, might not actually use it for morpheme-by-morpheme glossing.
I actually only recently realized it was even a thing! Here’s me getting excited about it in late 2020: x.com (and Ryan Sullivant chimes in as a treat!)

So yeah, a neat thing about this template, is that you have these predefined keywords in Leipzig that are recognized and explained in the hover text (e.g. over OBJ you’d see “Object(ive)” (eg in the article discussed in the tweet

But also, as I mention in that tweet, it’s not super duper widespread in its implementation quite yet, and it takes some (albeit minor) finessing to get more niche glosses to have the hover text. It’s a fun extra feature

joeylovestrand · January 27, 2022, 10:42am

I just want to know why your Wikipedia screenshots look like a DeGruyter publication when mine appears in Times New Roman.

pathall · January 27, 2022, 2:35pm

You can modify your default typeface by editing

https://meta.wikimedia.org/wiki/User:<yourusername>/global.css

Mine contains this:

@import url(https://fonts.googleapis.com/css?family=Fira+Sans:300,300italic,400,400italic,500,500italic,700,700italic&subset=cyrillic,cyrillic-ext,greek,latin,latin-ext);

p, table, td, th, h1,h2, h3, h4,h5 ,h6 , blockquote, a, em, strong { 
	font-family: 'Fira Sans' !important;
}

I am a fan of Fira Sans, I find it very readable online (I wonder if @skalyan likes it ).

pathall · January 27, 2022, 3:00pm

Yep. The output is currently messy for sure. And like you I didn’t even realize that people were going to cram non-interlinears into the {{interlinear}} template. People do all the things when technology allows for it…

The module that actually doing the text processing is super block-headed… it’s one function!:

export let extractInterlinears = mediawiki => mediawiki
      // make sure we break before the template begins
      .replaceAll(`{{interlinear`, `\n{{interlinear`, )
     // this mess deals with the custom abbreviations
     // see https://en.wikipedia.org/wiki/Template:Interlinear#Glossing_abbreviations
      .replaceAll(/\{\{gcl\|([^\|]+)\|(.*)\}\}/g, `$1`) // custom gloss label
     // now split the wikitext into blocks
      .split`\n\n`
      // and filter out the ones containing {{interlinear
      .filter(x => x.includes('{{interlinear'))
      // these are often morphological examples (derivation and such), remove them
      .filter(x => !x.includes('→'))
      // now back to a big string so we can do crazy find-and-replaces 
      .join('\n\n')
      .replaceAll('<u>', '')
      .replaceAll('</u>', '')
      .replaceAll('<sub>', '')
      .replaceAll('</sub>', '')
      .replaceAll(/\|top=/g, '\n')
      .replaceAll(`<nowiki>`,'')
      .replaceAll(`</nowiki>`,'')
      .split`\n`
      .map(line => line.trim())
      .filter(l => !l.includes('interlinear'))
      .join('\n')
      .replaceAll('|','')
      .replaceAll(`'''`,'')
      .replaceAll('{','')
      .replaceAll('}','')
      .replaceAll(/<ref.*/g, '')

I mean, bleh. But it does something.

Yep! In case anyone is interested, this is done by this kind of syntax inside the {{interlinear}} template:

{{gcl|OBJ|Object(ive)}}

I made a couple examples on a scratch page under my profile (In Slightly Outer Zborkizian) here

Interestingly, I happened across several instances of interlinears on Wikipedia where an error message of an unknown label was mucking up pages. It’s pretty easy to fix if you track down the abbreviation expansions in this sources. Here’s a diff where I did a couple on Sabanê.

pathall · January 27, 2022, 4:09pm

Oh dang, this was fun:

(I moved it to this better name.)

Managed to get two out of three done:

Put this stuff in a github repo instead of cramming it in articles here.
Figure out how to download all the articles at once and run the extraction offline
Try it on Wikipedias in other languages (Are template names in the wikitext the same on Wikipedias in every language?)

The Github link above will take you to a directory containing 518 () extracted files.

Bug reports welcome!

skalyan · January 29, 2022, 3:54am

I do! Not as much as I like Noto Sans, but it definitely looks print-quality (as @joeylovestrand observes).

I’ve been using Fira Code in RStudio since forever, because of the programming ligatures.

P.S. Happy birthday, @pathall!