Distance and Checksum Algorithms on Lists
Related to the programming competition, I've started thinking about algorithms for:
- measuring the distance between two orderings of the same list (I guess number of swaps required to go from one to the other; even better if the algorithm gave you the swaps required)
- order-sensitive checksums on lists of integers (as above, the lists always contain the same elements but the goal is to have a checksum that distinguishes different orderings with minimal clashes)
Any suggestions for approaches to these? Working Python code would be even better :-)
Comments (8)
James Tauber on Nov. 10, 2007:
I know Damerau-Levenshtein adds transposition. Jaro-Winkler might also be relevant.
But I haven't yet come across anything that *only* does reordering.
Hamming distance clearly won't work as it would give [1,2,3,4,5] and [1,5,2,3,4] a distance of 4 instead of 1.
Mind you, just writing that example is making me think Levenshtein might be work fine!
Doug Napoleone on Nov. 10, 2007:
Levenshtein would give a distance of 2 for the example you gave (one insertion, one deletion, assuming 1:1:1 weights). There are many nemesis cases for Levenshtein [1, 2, 3, 4] [4, 3, 2, 1] -> 4 subs (when in truth it's 2 swaps). Using 0.5:0.5:0.5 weights could work. Will need to think harder about that.
It feels like a decode is the right fit for this problem (but then everything looks like a decode to me after a while).
James Tauber on Nov. 10, 2007:
Regarding Levenshtein, what if I weighted it to make substitutions (effectively) infinitely expensive so it would end up with just insertions and deletions. Halving this would then give me the number of swaps. (right?)
I have started coming across some work by Shapira and Storer on edit distance based on substring moves which sounds relevant.
I don't know what you mean by decode.
Paddy3118 on Nov. 11, 2007:
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
- Paddy.
Hadi on Nov. 11, 2007:
The problem is: http://acm.tju.edu.cn/toj/showp2754.html
The solution sketch for that problem: http://ace.delos.com/TESTDATA/FEB07.csort.htm
Isn't this what you want? this is the weighted version of your problem, but I believe the weightless problem can also be solved in the same way.
The only thing that makes it un-general is that elements of list must be unique. I haven't thought about the case where the elements aren't unique yet. Any ideas?
James Tauber on Nov. 11, 2007:
In my case the elements of the list are unique.
Doug Napoleone on Nov. 12, 2007:
I was thinking along the lines of a Viterbi decode or something HMM based where the initial is the known input and the second is the hidden cause. (decode = dynamic programming decoder).
Then again something more along the lines of the LCSS (longest common subsequence) decoder [E. Myers] might be a better approach. I have some PDF's of the papers here, but can't seem to find them online. I did find a publication:
http://www.springerlink.com/content/f306g3p440371j77/
There is potential there for doing more than a common subsequence, and I have seen the basic principal used to do tree matches (find the best match in paths on a tree). You could treat the Levenshtein grid as a tree of all possible swaps, and then apply the LCSS walk using either input as the oracle result. This assumes that the least number of swaps is the same going from either input as the start. I forget the name of that proof.
Add a Comment
Last Modified: Nov. 10, 2007
Author: jtauber
Samat Jain on Nov. 10, 2007: