John Salerno wrote: > So as it turns out, Unicode and UTF-8 are not the same thing?
Well yes. UTF-8 is one scheme in which the whole Unicode character repertoire can be represented as bytes. Confusion arises because Windows uses the name 'Unicode' in character encoding lists, to mean UTF-16_LE, which is another encoding that can store the whole Unicode character repertoire as bytes. However UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is. Further confusion arises because the encoding 'UTF-16' can actually mean two things that are deceptively different: - Unicode characters stored natively in 16-bit units (using two UTF-16 characters to represent characters outside of the Basic Multilingual Plane) - Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected automatically using a Byte Order Mark when loaded, or chosen arbitrarily when saving Yet more confusion arises because UTF-32 (which can reference any Unicode character directly) has the same problem. And though wide-unicode builds of Python understand the first meaning (unicode() strings are stored natively as UTF-32), they don't support the 8-bit encodings UTF-32_LE and UTF-32_BE. Phew! To summarise: confusion. > Am I right to say that UTF-8 stores the first 128 Unicode code points > in a single byte, and then stores higher code points in however many > bytes they may need? That is correct. To answer the original question, we're always going to need byte strings. They're a fundamental part of computing and the need to process them isn't going to go away. However as Unicode text manipulation becomes a more common event than byte string processing, it makes sense to change the default kind of string you get when you type a literal. Personally I would like to see byte strings available under an easy syntax like b'...' and UTF-32 strings available as w'...', or something like that - currently having u'...' mean either UTF-16 or UTF-32 depending on compile-time options is very very annoying to the few kinds of programs that really do need to know the difference. But whatever is chosen, it's all tasty Python 3000 future-soup and not worth worrying about for the moment. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ -- http://mail.python.org/mailman/listinfo/python-list