2011-02-18

Debugging unsortable problems in Python

Working in Python 2.6.1 on my Mac I noticed the following behaviour recently while debugging the QTI migration code:

>>> 'z'<('a','b')
True
>>> ('a','b')<u'a'
True
>>> u'a'<'z'
True

These three comparisons, between a string, a tuple and a unicode string demonstrate that it is easily possible to create an unsortable list of objects out of basic immutable objects such as might be used as keys in a dictionary.

This might look a bit esoteric but I'm only writing this blog post because I caught a bug which was caused by the incorrect assumption that lists of strings, tuples and unicode strings sort predictably.  I was representing XML attribute names using tuples if an attribute had a defined namespace.  The names were then used as keys into a dictionary.  Note that both 'a' and u'a' can be used interchangeably in Python 2.6 when looking up an entry in a dictionary so it was easy to go one step further and grab the list of keys, sort them and assume that the result would be predictable.  Not so.

The order of the keys returned by the key() method of a dictionary is not defined and the sort method will return different results depending on the initial order of the resulting list.

It took me a while to find someone else struggling with a similar problem but I took great solace in Incomparable Abominations.   This blog post deals with changes from Python version 1 to version 2.

I believe that Python 3 is doing two things to address the problem I'm having.  Firstly, the sloppy lack of distinction between strings and unicode strings is being cleaned up.  The transition will be painful (and mean more work getting the QTI migration tool working on Python 3-based systems) but it will prevent the type of comparison loop above.  Comparisons are also being tightened to prevent different types comparing unpredictably, a (unicode) string and a tuple will not be comparable in future meaning I catch bugs like this one earlier.

So a better future awaits, but why do the comparisons give the results they do in Python 2?  The answer is almost poetic.  Objects of different types usually sort by their class name, the comparison of a string and a unicode string is the exception because, provided the string is 7-bit clean, it is assumed to be ascii and compared as a string of characters.  We can reveal the class names using the interpreter:

>>> 'z'.__class__.__name__
'str'
>>> ('a','b').__class__.__name__
'tuple'
>>> u'a'.__class__.__name__ 'unicode'

As you can see, the type names start with the alphabetic sequence 's','t','u'.