James Tauber's Blog 2006/01/27


blog > 2006 > 01 >


Python Unicode Collation Algorithm

My preliminary attempt at a Python implementation of the Unicode Collation Algorithm (UCA) is done and available at:

http://jtauber.com/2006/01/27/pyuca.py (old version—see UPDATE below)

This only implements the simple parts of the algorithm but I have successfully tested it using the Default Unicode Collation Element Table (DUCET) to collate Ancient Greek correctly.

The core of the algorithm, which is what I have implemented, basically just involves multi-level comparison. For example, café comes before caff because at the primary level, the accent is ignored and the first word is treated as if it were cafe. The secondary level (which considers accents) only applies then to words that are equivalent at the primary level.

The UCA (and my code) also support contraction and expansion. Contraction is where multiple letters are treated as a single unit—in Spanish, ch is treated as a letter coming between c and d so that, for example, words beginning ch should sort after all other words beginnings with c. Expansion is where a single letter is treated as though it were multiple letters—in German, ä is sorted as if it were ae, i.e. after ad but before af.

Here is how to use the pyuca module.

Usage example:

from pyuca import Collator c = Collator("allkeys.txt")

sorted_words = sorted(words, key=c.sort_key)

allkeys.txt (1 MB) is available at

http://www.unicode.org/Public/UCA/latest/allkeys.txt

but you can always subset this for just the characters you are dealing with (and you will need to do this if any language-specific tailoring is needed)

UPDATE (2006-02-13): Now see bug fix

UPDATE (2012-06-21): Now see https://github.com/jtauber/pyuca

by : Created on Jan. 27, 2006 : Last modified June 21, 2012 : (permalink)


Mozart 250th

Today is the 250th anniversary of the birth of Wolfgang Amadeus Mozart.

It's hard to overstate just how much of an influence Mozart has been on me as a composer.

I think at some point around the age of 13, I decided that I liked Mozart more than Beethoven and while my esteem for Beethoven and (especially) Bach have increased over the years, Mozart dominated my teens.

I taught myself to compose almost entirely by studying scores of Mozart. The Western Australia State Library building had something called the Central Music Library which was the only part of the library you could borrow directly from (as opposed to via inter-library loan at a local library).

I think there was a period of my life where every couple of weekends my mum (whose birthday it also is today) would drive me to the Central Music Library to borrow scores of Mozart symphonies and concerti. I would mark on my calendar every Mozart piece scheduled to be played on the national classical music radio station and taped many of them. Over half the CDs I bought in high school were probably Mozart.

I'd listen to the music, reading along in the score, marking sections I liked and then going back and analyzing them. Studying his scores is how I learnt classical form, orchestration and harmony. Even when I wanted to learn to write fugues, my first model was the Kyrie from his Requiem rather than something from Bach's Well-Tempered Clavier or Art of Fugue.

Thank you Mozart.

by : Created on Jan. 27, 2006 : Last modified Jan. 27, 2006 : (permalink)