James Tauber

journeyman of some

blog > 2007 > 11 > 10 >

Distance and Checksum Algorithms on Lists

Related to the programming competition, I've started thinking about algorithms for:

Any suggestions for approaches to these? Working Python code would be even better :-)

Comments (8)

Samat Jain on Nov. 10, 2007:

For measuring distance between two orderings of the same list--can't you use the Levenshtein edit distance algorithm?

James Tauber on Nov. 10, 2007:

I did consider Levenshtein but I'm not sure that's really the best measure given that the operations are completely different.

I know Damerau-Levenshtein adds transposition. Jaro-Winkler might also be relevant.

But I haven't yet come across anything that *only* does reordering.

Hamming distance clearly won't work as it would give [1,2,3,4,5] and [1,5,2,3,4] a distance of 4 instead of 1.

Mind you, just writing that example is making me think Levenshtein might be work fine!

Doug Napoleone on Nov. 10, 2007:

I am a little unclear on the checksum, I assume you mean a checksum per element. For that a sleazy way to do it would be to do it would be to use a position in the list as the seed. I can't think of a nemesis collision case which would not apply equally to other systems.

Levenshtein would give a distance of 2 for the example you gave (one insertion, one deletion, assuming 1:1:1 weights). There are many nemesis cases for Levenshtein [1, 2, 3, 4] [4, 3, 2, 1] -> 4 subs (when in truth it's 2 swaps). Using 0.5:0.5:0.5 weights could work. Will need to think harder about that.

It feels like a decode is the right fit for this problem (but then everything looks like a decode to me after a while).

James Tauber on Nov. 10, 2007:

For the checksum, I wanted to generate a single checksum for the whole list but it would be nice if it could be recalculated easily for a swap. I did consider something involving the position in the list, along with something like XOR which is reversible. To be honest, hash() is probably fast enough I could not worry about incrementally updating and just do a hash(tuple(x)) every time the list x changes.

Regarding Levenshtein, what if I weighted it to make substitutions (effectively) infinitely expensive so it would end up with just insertions and deletions. Halving this would then give me the number of swaps. (right?)

I have started coming across some work by Shapira and Storer on edit distance based on substring moves which sounds relevant.

I don't know what you mean by decode.

Paddy3118 on Nov. 11, 2007:

Doesn't a CRC take position into account?
http://en.wikipedia.org/wiki/Cyclic_redundancy_check

- Paddy.

Hadi on Nov. 11, 2007:

About the distance problem: There was a USACO problem similar to what you want.

The problem is: http://acm.tju.edu.cn/toj/showp2754.html
The solution sketch for that problem: http://ace.delos.com/TESTDATA/FEB07.csort.htm

Isn't this what you want? this is the weighted version of your problem, but I believe the weightless problem can also be solved in the same way.

The only thing that makes it un-general is that elements of list must be unique. I haven't thought about the case where the elements aren't unique yet. Any ideas?

James Tauber on Nov. 11, 2007:

Hadi, thanks for the links.

In my case the elements of the list are unique.

Doug Napoleone on Nov. 12, 2007:

James,

I was thinking along the lines of a Viterbi decode or something HMM based where the initial is the known input and the second is the hidden cause. (decode = dynamic programming decoder).

Then again something more along the lines of the LCSS (longest common subsequence) decoder [E. Myers] might be a better approach. I have some PDF's of the papers here, but can't seem to find them online. I did find a publication:
http://www.springerlink.com/content/f306g3p440371j77/

There is potential there for doing more than a common subsequence, and I have seen the basic principal used to do tree matches (find the best match in paths on a tree). You could treat the Levenshtein grid as a tree of all possible swaps, and then apply the LCSS walk using either input as the oracle result. This assumes that the least number of swaps is the same going from either input as the start. I forget the name of that proof.


Add a Comment

Created: Nov. 10, 2007
Last Modified: Nov. 10, 2007
Author: jtauber