Adventures with bookmarklets and transliteration

In the old days, before Unicode, people did some crazy stuff to try to get complex text to appear on the web.

I have been studying a language called Pali, and this website contains a text called the Dhammapada. The text is traditionallly divided into numbered, 4- or 8-line stanzas called gathas, and each of those gets its on page on the site.

So Gatha 1 one looks like this:

http://buddhism.lib.ntu.edu.tw/BDLM/lesson/pali/reading/gatha1.htm

There’s a ton of syntactic, morphological and lexical information in these pages, but there is a serious problem: the content is weirdly encoded. In this post I would like to explain what that problem was, and how a rather kooky work-around called a “bookmarklet” can be used to help. Creating bookmarklets is a bit of a hassle, even if you know Javascript, but nonetheless I think discussing what bookmarklets do could potentially be enlightening about some aspects of the web we all use all day.

Pali

Pali is a pretty neat language for several reasons, but I’m going to try to stay on topic! For the purposes of this discussion, all we need to keep in mind is that there is a Romanized orthography that is widely used (and used on the site linked above), and that that orthography has several diacritics. Vowels can be marked with a macron (ā, ī, ū), and like many Indo-Aryan languages, retroflex consonants are written with a dot below — , , etc. (and there are a few more; you can see a full table here).

Anyway, the text at Gatha 1 looks fine:

Manasā ce paduṭṭhena

Lovely little diacritics.

But things get complicated later on. For instance, the second line of Gatha 100 appears like this:

ekaj atthapadaj seyyo yaj sutva upasammati

This is all kinds of not right. Pali doesn’t have final j, ever. What gives?

It’s in the HTML

I right-clicked (command-clicked, actually, in my case, as I use a Mac) and then ran inspect element on that second line (try it on the page!) and boy howdy, did I see some… special HTML.

If you do any work in HTML you will know that this is… gross. <center> is obsolete, as is, emphatically, <font>. So that’s the first clue we’re dealing with old stuff here.

It’s not hard to find other versions of this text (it’s quite famous), and then we can compare what we have with what we expect.

So a correct rendering of the line is:

Ekaṃ atthapadaṃ seyyo, yaṃ sutvā upasammati.

So, aligning the two:

ekaj atthapadaj seyyo yaj sutva upasammati
ekaṃ atthapadaṃ seyyo, yaṃ sutvā upasammati.

So the «j» corresponds to an … «ṃ»?

And the «a» corresponds to an … «ā»??

Waaait a second… both «a» and «ā» exist in Pali!

In technical terms, WTF?

I did some digging and figured out the whole “transliteration” table:

crazy not crazy
a ā
j
n
v
i ī
b ñ
t

How in the sam hell they came up with that scheme I do not know.

So fix it dear Henry

The information in these pages is so useful to me that I was motivated to figure out how to fix it, somehow. What were my options?

  1. I could download every page from the site, and write a program to fix it so I would have a local mirror.
  2. I could email the people who run the site and ask them to fix it. I’m much to impatient for that.
  3. I could try to fix in when I’m on the site, using Javascript.

Naturally I went with option 3.

Man, this post is getting long. :man_shrugging:

So taking a closer look at the HTML, we see this:

<center>
<h1>
<font size="+2">eka<font face="Foreign1">j</font> atthapada<font face="Foreign1">j</font>
seyyo ya<font face="Foreign1">j</font> sutv<font face="Foreign1">a</font>
upasammati</font></h1></center>

ekaj atthapadaj seyyo yaj sutva upasammati

So let’s forget abotu the <center>, <h1>, and <font size="+2"> (blech!) for a minute, and focus on the <font face="Foreign1"> bits. In particular, compare these two words: yaṃ and sutvā. The first one has a short «a», and the second has a long ā.

But look at the HTML for those two words:

 ya<font face="Foreign1">j</font> 
sutv<font face="Foreign1">a</font>

The mystery is solved

This crazy HTML is using an equally crazy font where letters pretend to be other letters!

:thinking:

That’s right kids, for this page to “look” right, you have the font known as Foreign1 on your system (you have to install it just for this site!). Whatever that font is — there’s no way to know from just reading the actual page.

And in that crazy font (which I very much do not want anywhere near my system!), any instance of a letter from the “crazy” column is told to render as its equivalent in the “non crazy” column. Inside of <font face=Foreign1>, «a» means «ā». Anywhere else, «a» is «a».

This means that if you have this crazy font, in order to search for the word yaṃ, you have to type yaj in your search box.

This is not the way to enlightenment.

Henry you have not fixed it yet

Phew. So now we need to fix it.

I’m running out of steam so I’m not going run through the code I used to fix it in too much detail, but in general terms it goes like this:

  1. Store the crazy/notcrazy correspondences
  2. Find all the <font face=Foreign1> tags
  3. Replace every crazy character inside those tags with its corresponding not crazy character.

Done.

Here’s one way to write that in Javascript:


let fix = () => {
  // (1) crazy/not crazy correspondences
  let fixes = {
    "a": "ā",
    "j": "ṃ",
    "n": "ṇ",
    "v": "ṅ",
    "i": "ī",
    "b": "ñ",
    "t": "ṭ"
  } 

  // (2) the offending tags
  let fontTags = document.querySelectorAll('font[face=Foreign1]')
  
  // (3) the complicated bit that carries out the substitution
  fontTags.forEach(fontTag => {
    let content = fontTag.textContent.trim()
    let crazies = content.split``
    fontTag.outerHTML = crazies
      .map(crazy => fixes[crazy] ? fixes[crazy] : crazy)
      .join("")
  })
}

// (4) run the function.

So if you open the console and paste in that code and run it on a page like Gatha 100, you will see the crazy go away, presto-change-o.

I am easily amused.

One more thing: a bookmarklet

Well, great. But cutting and pasting code, opening the console, and then running it is an awful lot of work. If I’m (hah!) going to read all 423 of these things, do I have to do that 423 times?

That would suck.

Well, we can at least get it down to one-click-per-page using a weird thing called a bookmarklet. Again, there are some weird aspects (even for a Javascript programmer) that I won’t go into here about how one converts random JS code into a bookmarklet, but the basic idea is that you’re creating a link (an <a>) whose value is not a URL, but some code.

One way to create a bookmarklet is just to drag then from a page into your bookmarks toolbar on your browser (here’s a venerable collection (some of which no longer work, but you get the idea)).

So I set one of those up and stuck it on the web here:

https://ruphus.com/fix-ntu.html

You can drag that to your bookmarklets and go to a page like this one, and then click it. And remember not to get too angry at your computer.

Does this have anything to do with language documentation?

Well, sort of. In cases where you are working with some web resource a lot, a bookmarklet can sometimes save you tons of time or at least make your life easier. For instance, the UCLA Phonetics Archive has an astounding amount of content. But there are some features that are super annoying: one is that all the audio files are only linked, not shown as <audio> tags. This means that they cannot be played while you are browsing the page.

That sucks.

So, you could write a thing that creates an inline play button, like this:

document.querySelectorAll('a[href$=wav]')
  .forEach(a => {
  a.innerHTML = `▶️ PLAY`
    a.addEventListener('click', e => {
      // don’t follow the link
      e.preventDefault() 

      // create an audio tag
      let audio = document.createElement('audio')
      audio.src = a.href
      audio.controls = true
      a.after(audio)

    })
  })

A bookmarklet that carries out that step like that could conceivably be useful on a page with a ton of recordings, like this one on Zulu.

Here’s a bookmarklet you can try if so inclined:

https://ruphus.com/play-ucla-bookmarklet.html

3 Likes

It occurs to me that this whole bookmarklet business doesn’t work on phones.

1 Like

This is so wild?? The idea of bookmarklets is pretty intuitive and feels kinda like an the inverse of (or perhaps an extension of) js template literals, which have always been pretty neat anyway!
The HTML snippet you showed from the original (with the font and center) smells a lot like LaTeX, which I find kind of hilarious given how much simple/accessibly HTML is supposed to be (especially for such a basic use case like this). I thought personalized Unicode code blocks (or whatever they’re called…like the very end of it where Unicode will never define characters so people can use them for themselves) were weird and tough to slog through, but this…yeesh.
Nothing productive to say, just a neat post ^.^

1 Like

Hi Sunny!

Bookmarklets can be kind of useful but they are weird in that they aren’t really “in” the web, you know? They are like this wierd side-thing — I put them in the same box as browser extensions/add-ons, which can be useful but I haven’t ever really bothered to learn. (Come to think of it, an extension would probably be a better option in this circumstance anyway.)

Good to hear from you!

I remember using Tampermonkey (browser extension on Chrome and Firefox) for running scripts like these automatically on certain website. That would get rid of the need to even click the bookmarklet without having to make a full-blown extension :slight_smile:

1 Like

Huh, I haven’t heard of Tampermonkey, but I definitely remember Greasemonkey, which sounds similar. There used to be something of a cottage industry in “userscripts”. Looks like userscripts.org is now available only as a static mirror. (Although what became of the original is pretty hilarious!)

1 Like