python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python
from unicodedata import * # each unicode char has a unique name. # one can use the âlookupâ func to find it mychar=lookup('greek cApital letter sIgma') # note letter case doesn't matter print mychar.encode('utf-8') m=lookup('CJK UNIFIED IDEOGRAPH-5929') # for some reason, case must be right here. print m.encode('utf-8') # to find a char's name, use the ânameâ function print name(u'å') basically, in unicode, each char has a number of attributes (called properties) besides its name. These attributes provides necessary info to form letters, words, or processing such as sorting, capitalization, etc, of varous human scripts. For example, Latin alphabets has two forms of upper case and lower case. Korean alphabets are stacked together. While many symbols corresponds to numbers, and there are also combining forms used for example to put a bar over any letter or character. Also some writings systems are directional. In order to form these symbols for display or process them for computing, info of these on each char is necessary. the rest of functions in unicodedata return these attributes. see unicodedata doc: http://python.org/doc/2.4/lib/module-unicodedata.html Official word on unicode character properties: http://www.unicode.org/uni2book/ch04.pdf -- i don't know what's the state of Perl's unicode. Is there something similar? -- this post is archived at http://xahlee.org/perl-python/unicodedata_module.html Xah [EMAIL PROTECTED] http://xahlee.org/PageTwo_dir/more.html -- http://mail.python.org/mailman/listinfo/python-list