Sorting in Python with Identical Comparison Keys


Back in November last year, I wrote about treating the ordering of vocabulary learning as a travelling salesman problem.

On the plane back to Perth, I dug up my implementation and started cleaning it up ready to make it available here.

But then I started noticing some odd behaviour sorting in Python 2.4 when more than one item share the same comparison key.

Here is my code:

def freq_order(self): """ return list of atoms in order of frequency they appear as prerequisites. """ atom_list = self.atoms[:]

def freq(x): return len(self.p2g[x])

atom_list.sort(key=freq, reverse=True)

self.p2g is just a dictionary mapping atoms to a set of the goals they are prerequisites for. self.atoms is just a list of atoms (same as self.p2g.keys())

Now, the odd thing is that different runs of my script result a different frequency ordering. If I run the script multiple times in a row, it will get the same result. However, if I open up a new terminal and run it, I'll get a different result. I can consistently alternative between two results also by alternating between python test.py and ./test.py

I decided to dig a little into how Python 2.4 handles multiple items with the same comparison key. It turns out that, relative to that subset, it maintains the original order of the list. This is actually a very useful characteristic but it also indicated the problem was likely in the ordering of the list even before sorting.

Here is how self.atoms comes into being:

for goal in self.goals: for prereq in goal.prereqs: p2g.setdefault(prereq, set()).add(goal)

self.atoms = p2g.keys()

where self.goals comes from:

self.goals = set()

for goal in prereqs_by_goal: self.goals.add(Goal(goal, prereqs_by_goal[goal]))

and prereqs_by_goal comes from:

for line in file(filename): goal, prereq = line.strip().split() prereqs_by_goal.setdefault(goal, set()).add(prereq)

It appears that all these sets result in a non-deterministic ordering atoms being fed into the sorter. It seems I'll have to alter my code to maintain a deterministic ordering.

The consistent python test.py versus ./test.py alternation is still odd, though.

UPDATE: I wonder if the problem is lack of a good hash on the Goal class. If it's using a memory location then that might explain the python test.py versus ./test.py alternation.

UPDATE 2: Yep, looks like it's fixed with a decent __hash__ implementation. Good to know I can still make rookie mistakes :-)

The original post was in the category: python but I'm still in the process of migrating categories over.