[EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: ... > > > Maybe I'm missing something fundamental here, but if I have a list of > > > Unicode strings, and I want to sort these alphabetically, then it > > > places those that begin with unicode characters at the bottom. ... > Anyway, I know _why_ it does this, but I really do need it to sort > them correctly based on how humans would look at it.
Depending on the nationality of those humans, you may need very different sorting criteria; indeed, in some countries, different sorting criteria apply to different use cases (such as sorting surnames versus sorting book titles, etc; sorry, I don't recall specific examples, but if you delve on sites about i18n issues you'll find some). In both Swedish and Danish, I believe, A-with-ring sorts AFTER the letter Z in the alphabet; so, having Ă…aland (where I'm using Aa for A-with-ring, since this newsreader has some problem in letting me enter non-ascii characters;-) sort "right at the bottom", while it "doesn't look right" to YOU (maybe an English-speaker?) may look right to the inhabitants of that locality (be they Danes or Swedes -- but I believe Norwegian may also work similarly in terms of sorting). The Unicode consortium does define a standard collation algorithm (UCA) and table (DUCET) to use when you need a locale-independent ordering; at <http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm> you'll be able to obtain James Tauber's Python implementation of UCA, to work with the DUCET found at <http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>. I suspect you won't like the collation order you obtain this way, but you might start from there, subsetting and tweaking the DUCET into an OUCET (Oliver Unicode Collation Element Table;-) that suits you better. A simpler, rougher approach, if you think the "right" collation is obtained by ignoring accents, diacritics, etc (even though the speakers of many languages that include diacritics, &c, disagree;-) is to use the key=coll argument in your sorting call, passing a function coll that maps any Unicode string to what you _think_ it should be like for sorting purposes. The .translate method of Unicode string objects may help there: it takes a dict mapping Unicode ordinals to ordinals or string (or None for characters you want to delete as part of the translation). For example, suppose that what we want is the following somewhat silly collation: we only care about ISO-8859-1 characters, and want to ignore for sorting purposes any accent (be it grave, acute or circumflex), umlauts, slashes through letters, tildes, cedillas. htmlentitydefs has a useful dict called codepoint2name that helps us identify those "weirdy decorated foreign characters". def make_transdict(): import htmlentitydefs cp2n = htmlentitydefs.codepoint2name suffixes = 'acute crave circ uml slash tilde cedil'.split() td = {} for x in range(128, 256): if x not in cp2n: continue n = cp2n[x] for s in suffixes: if n.endswith(s): td[x] = unicode(n[-len(s)]) break return td def coll(us, td=make_transdict()): return us.translate(td) listofus.sort(key=coll) I haven't tested this code, but it should be reasonably easy to fix any problems it might have, as well as making make_transdict "richer" to meet your goals. Just be aware that the resulting collation (e.g., sorting a-ring just as if it was a plain a) will be ABSOLUTELY WEIRD to anybody who knows something about Scandinavian languages...!!!-) Alex -- http://mail.python.org/mailman/listinfo/python-list