Of Levenshtein distance

pathall · January 11, 2021, 8:12pm

In our writing group today, @rgriscom mentioned something called the Levenshtein distance, and @SarahDopierala, @Andrew_Harvey, and I talked about it. I thought it would be fun to start a topic.

Here’s the Wikipedia article:

Basically the idea is to do with what is sometimes called “fuzzy matching” or “string edit distance”.

I found this cool demo which will give you an idea of how it works:

https://phiresky.github.io/levenshtein-demo/

In this case the distance between elephant and relevant is 3. That’s because it takes 3 changes (“edits”) to turn one into the other:

Insert initial r
Replace p with v
Delete h

This is the basis for fuzzy matching a search string to all the strings in a list. Basically you calculate the Levenshtein distance for the search string vs each word in the list, and sort. The ones with small Levenshteins might be typos, or variants, or related words, or whatever. (In @rgriscom’s case the goal was to search for duplicate names in a big list of speakers.)

“You’re welcome!” — Vladimir Levenshtein

rgriscom · January 12, 2021, 11:23am

Thanks for this background info, Pat! Here is a quick colab notebook demonstrating how to calculate the Levenshtein distance between two strings in Python:

rgriscom · January 14, 2021, 8:56am

Just taking it a step further…

Here is a second version of the script which allows for multiple languages (or varieties). It displays a table of the distances for each pair and calculates the languages which have the lowest distance and greatest distance. Then I added on some hierarchical clustering code I gleaned from the interwebz to make a nice-looking dendrogram.

Some potential next steps:

Tweak the script so that it accepts a .csv file as input with multiple cognates for multiple languages, and calculates the average distance across all cognates for each language pair.
Try using a different distance calculation method

rgriscom · January 14, 2021, 10:50am

Try using LingPy

aryaman · January 27, 2021, 6:48pm

Are there any good distance calculation methods between two IPA strings? Wondering if you can use distance between phones in terms of place of articulation, manner, etc. to have a finer picture than just Levenshtein.

pathall · January 27, 2021, 8:43pm

Yeah this is an interesting idea, @rgriscom and @Andrew_Harvey and I were talking about it a little a while back. The distance can be calculated for strings of arrays just like for strings (at least for some implementations):

(Found a random implementation).

let levenshtein = ( first, second ) => {
	let d	 = []
	let flen = first.length
	let slen = second.length
	
	for( i = 0; i <= flen; i++ ){
		d[i] = d[i] ? d[i] : []
		d[i][0] = i
	}
	
	for( j = 0; j <= slen; j++ ){
		d[0][j] = d[0][j] ? d[0][j] : [];
		d[0][j] = j
	}
	
	for( j = 1; j <= slen; j++ ){
		for( i = 1; i <= flen; i++ ){
			if( first[i - 1] == second[j - 1] ){
				d[i][j] = d[i-1][j-1]
			} else {
				d[i][j] = Math.min( 
					d[i-1][j]	+ 1, 
					d[i][j-1]	+ 1, 
					d[i-1][j-1] + 2 
				)
			}
		}
	}
	
	return d[flen][slen]
}


let a = "houses"
let b = "louse"

levenshtein(a,b) === levenshtein(a.split(""), b.split(""))

Given that, it seems like you could factor in a function that weights the distance with a phonetic distance in terms of features, so like maybe p and b a represented as

let p = {"manner": "plosive", "place": "bilabial", "voicing":"voiceless", "ipa": "p"}
let b = {"manner": "plosive", "place": "bilabial", "voicing":"voiced", "ipa": "b"}
// and so forth…

and then

distance(p, b) // this returns 1 or something

It could get a little hairy — is distance(p,b) less than or equal to distance(p,t)? But it seems doable.

This one seems to review the topic (though it’s old):

Kondrak, Grzegorz. 2003. Phonetic Alignment and Similarity. Computers and the Humanities 37(3). 273–291. doi:10.1023/A:1025071200644.

And this is throws some cold water on the idea for cognate similarity:

Greenhill, Simon J. 2011. Levenshtein Distances Fail to Identify Language Relationships Accurately. Computational Linguistics 37(4). 689–698.

rgriscom · February 8, 2021, 8:11am

LingPy has some built-in Sound Class Models which I think cover some of what Pat is talking about. Here is one of the relevant papers: SCA: Phonetic Alignment Based on Sound Classes | SpringerLink