This post is about some notes and corrections to a online article regarding unicod and python.
-------------- by happenstance i was reading: Unicode HOWTO http://www.amk.ca/python/howto/unicode Here's some problems i see: ・ No conspicuous authorship. (however, oddly, it has a conspicuous acknowledgement of names listing.) (This problem is a indirect consequence of communism fanatism ushered by OpenSource movement) (Originally i was just going to write to the author on some corrections.) ・ It's very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. Not true. In Asia, most chars has unicode number above 255. Considered globally, *possibly* today there are more computer files in Chinese than in all latin-alphabet based lang. ・ Many Internet standards are defined in terms of textual data, and can't handle content with embedded zero bytes. Not sure what he mean by "can't handle content with embedded zero bytes". Overall i think this sentence is silly, and he's probably thinking in unix/linux. ・ Encodings don't have to handle every possible Unicode character, .... This is inane. A encoding, by definition, turns numbers into binary numbers (in our context, it means a encoding handles all unicode chars by definition). What he really meant to say is something like this: "Practically speaking, most computer languages in western society don't need to support unicode with respect to the language's source file" ・ UTF-8 has several convenient properties: 1. It can handle any Unicode code point. ... As mentioned before, by definition, any Unicode encoding encodes all unicode char set. The mentioning of above as a "convenient property" is inane. ・ 4.UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. Note here, that utf-8 is relative compact only if most of your text are latin alphabets. If you are not a occidental men and you write Chinese, utf-8 is comparatively inefficient. (utf-8 as one of the Unicode encoding is probably comparatively inefficient for japanese, korean, Arabic, or any non-latin-alphabet based langs) Also note, the article overly focus on utf-8. Microsoft's Windows NT, is probably the first major operating system that support unicode throughly, and they use utf-16. For Much of America and Europe, which are currently roughly the leader in computing, utf-8 is more efficient in some sense (e.g. at least in disk space requirements). But consider global computing, in particular Chinese & Japanese, utf-16 is overall superior than utf-8. Part of the reason, that utf-8 is favored in this article, has to do with Linux (and unix). The reason unixes in general have choosen utf-8 instead of utf-16, is largely because unix is one motherfucking bag of shit that it is impossible to support utf-16 without scraping a large chuck of unix things. PS I did not read the article in detail, but only roughly to see how Python handle unicode because i was often confused by python's encode/ decode/unicode methods and functions. ... am gonna continue reading that article about Python specific issues... also note, this post is posted thru groups.google.com, and it contains the double angled quotation mark chars. As of 2 weeks ago, it quotation marks seems to be deleted in the process of posting, i.e. unicode name: "LEFT-POINTING DOUBLE ANGLE QUOTATION MARK" and "RIGHT- POINTING DOUBLE ANGLE QUOTATION MARK". Here, i enclose the double- angled quation mark inside a double curly quote: " ". If inside the double curly quote you see spaces, than that means google groups fucked up. References and Further readings: ・ Unicode in Perl & Python http://xahlee.org/perl-python/unicode.html ・ the Journey of a Foreign Character thru Internet http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html ・ Unicode Characters Example http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html ・ Python's unicodedata module http://xahlee.org/perl-python/unicodedata_module.html ・ Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html ・ Java Tutorial: Unicode in Java http://xahlee.org/java-a-day/unicode_in_java.html ・ Character Sets and Encoding in HTML http://xahlee.org/js/html_chars.html Xah [EMAIL PROTECTED] ∑ http://xahlee.org/ -- http://mail.python.org/mailman/listinfo/python-list