On Tue, Oct 19, 2010 at 12:02 PM, Tobiah <t...@rcsreg.com> wrote: > I've been reading about the Unicode today. > I'm only vaguely understanding what it is > and how it works.
Petite Abeille already pointed to Joel's excellent primer on the subject; I can only second their endorsement of his article. > Please correct my understanding where it is lacking. <snip> > Now for the mysterious encodings. There is the UTF-{8,16,32} > which only seem to indicate what the binary representation > of the unicode character points is going to be. Then there > are 100 or so other encoding, many of which are language > specific. ASCII encoding happens to be a 1-1 mapping up > to 127, but then there are others for various languages etc. > I was thinking maybe this special case and the others were lookup > mappings, where a > particular language user could work with characters perhaps > in the range of 0-255 like we do for ASCII, but then when > decoding, to share with others, the plain unicode representation > would be shared? There is no such thing as "plain Unicode representation". The closest thing would be an abstract sequence of Unicode codepoints (ala Python's `unicode` type), but this is way too abstract to be used for sharing/interchange, because storing anything in a file or sending it over a network ultimately involves serialization to binary, which is not directly defined for such an abstract representation (Indeed, this is exactly what encodings are: mappings between abstract codepoints and concrete binary; the problem is, there's more than one of them). Python's `unicode` type (and analogous types in other languages) is a nice abstraction, but at the C level it's actually using some (implementation-defined, IIRC) encoding to represent itself in memory; and so when you leave Python, you also leave this implicit, hidden choice of encoding behind and must instead be quite explicit. > Why can't we just say "unicode is unicode" > and just share files the way ASCII users do. Because just "Unicode" itself is not a scheme for encoding characters as a stream of binary. Unicode /does/ define many encodings, and these encodings are such schemes; /but/ none of them is *THE* One True Unambiguous Canonical "Unicode" encoding scheme. Hence, one must be specific and specify "UTF-8", or "UTF-32", or whatever. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list