Anyone here have experience with Wikidata?

pathall · February 1, 2022, 6:03pm

I’ve just spent way too much time messing around with Wikidata.org:

This project is… big. But it’s interesting to think about how it might be helpful for linguists.

Wikidata uses a query language called SPARQL, which has quite a learning curve. I had to follow along in several tutorials before I felt like I had any inkling of what was going on. I think the best first step is to just try running some example queries. There are several examples available in the button at the top of the query interface:

https://query.wikidata.org/

A lot of cat pictures
A rather interesting dynamic dictionary (There’s a whole category of example searches having to do with “lexemes”, definitely worth more investigation.)
Another interesting tool that could be used for fieldwork: a dynamic picture dictionary. The trick here would be to figure out how to intersect this kind of query with something else (say, wildlife of a particular region)

And so forth.

Here’s an example of one I cobbled together from examples. It maps the birthplaces of female linguists in Wikipedia:

#Find the birth place of all female linguists
#defaultView:Map
SELECT ?item ?itemLabel  ?place ?coord
WHERE {
    ?item wdt:P31 wd:Q5 .         # is a human
    ?item wdt:P106 wd:Q14467526 . # is a linguist
    ?item wdt:P21 wd:Q6581072 .   # is a female
    ?item wdt:P19 ?place.
    ?place wdt:P625 ?coord.
    
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en,fr,de,es,it,no" }
}

You can try running it yourself if you like.

Like I said, the query language not user-friendly at all. But the output is!

Obvious and disappointing bias there, but still, 1681 female linguists, that’s a lot. Of course if you can master SPARQL you could do a zillion other things.

One more quick example:

SELECT ?language ?languageLabel ?speakers WHERE {
  
  ?language wdt:P31 wd:Q34770 ;
            wdt:P17 wd:Q1033 ;
            wdt:P1098 ?speakers
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

ORDER BY DESC(?speakers)

try it here…

This plots the languages of Nigeria ordered by number of speakers.

Is it reliable? Well, as reliable as Wikipedia is, I guess. But where else could one get such data so immediately, and in such a convenient form? (You can download a JSON or CSV file with one click!)

Anyway, I thought I’d start this topic so we could use it as a place to figure out more about this resource.

cbowern · February 3, 2022, 6:35pm

this is cool! The only thing I know about wikidata is to be careful about what you add to it, as it’s impossible to remove data once it’s in there. With living people and the links it can create, it’s easy to unintentionally make it way too easy to learn private information about someone

xrotwng · March 29, 2022, 2:03pm

I use wikidata to retrieve wikidata (and wikipedia) URLs for Glottolog languages, see pyglottolog/wikidata.py at master · glottolog/pyglottolog · GitHub

But although I like SQL a lot, wrapping my head around SPARQL hurts.

pathall · March 29, 2022, 2:25pm

I feel that!

All I have ever done with Wikidata is go through tutorials on YouTube and try things out in the query interface. I get all excited about it, and then do other stuff, and come back to it later and realize I have forgotten % of how to do anything. There’s just something about the syntax that refuses to stick in my head. I suppose the only answer is more practice.

pathall · March 29, 2022, 2:31pm

As long as we’re on the topic, it was cool to see that it’s pretty straightforward to run the SPARQL query in your linked code in the query editor:

Generate Glottolog data

If anyone would like to try it, just click and just hit :

(Incidentally, I like that LinkProvider base class you use. Javascript does that iteration business too (yield and all that), but I have yet to wrap my head around it.

xrotwng · March 29, 2022, 3:14pm

Yes, yield, yield from, itertools and collections are definitly worth looking into when diving into python. And Javascript seems to get saner over time, too I try to make sure to look up stuff on MDN Web Docs rather than going with whatever comes up first on stackoverflow to catch up on that.

aryaman · March 30, 2022, 7:18pm

I love WikiData, but they really fumbled the bag with the recent Lexeme service (a structured multilingual dictionary, queryable through WikiData). Huge waste of effort, they should be scraping/parsing stuff from existing wikis rather than building from scratch. This is also a general WikiData problem, it is not connected in any way to infoboxes and other existing structured data on Wikipedia.

xrotwng · March 30, 2022, 7:34pm

@aryaman oh, I didn’t know about the disconnect between wikipedia’s structued data and WikiData. I thought that’s where WikiData gets the Glottocodes from.

aryaman · March 30, 2022, 8:15pm

It’s… complicated. There are ways to call WikiData to generate e.g. infoboxes on Wikipedia but the English Wikipedia has really lagged in adopting it, and crossing over information to WikiData’s structured format. Other language editions really benefit from it though.

xrotwng · March 31, 2022, 5:41am

So does this mean the Glottocodes in WikiData may have been seeded from English Wikipedia a long time ago but changes in Wikipedia won’t make it back into WikiData? That would be a problem, because as far as I know, the main person adding Glottocodes does this in WIkipedia - not WikiData.

aryaman · April 6, 2022, 2:34am

Precisely, it’s (mostly) uncoordinated.