On 10/19/2010 4:31 PM, Tobiah wrote:
There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala Python's
`unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is not
directly defined for such an abstract representation (Indeed, this is
exactly what encodings are: mappings between abstract codepoints and
concrete binary; the problem is, there's more than one of them).
Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?
The hundred or so language-specific encodings all pre-date unicode and
are *not* unicode encodings. They are still used because of inertia and
local optimization.
There are currently about 100000 unicode codepoints, with space for
about 1,000,000. The unicode standard specifies exactly 2 internal
representations of codepoints using either 16 or 32 bit words. The
latter uses one word per codepoint, the former usually uses one word but
has to use two for codepoints above 2**16-1. The standard also specifies
about 7 byte-oriented transer formats, UTF-8,16,32 with big and little
endian variations. As far as I know, these (and a few other variations)
are the only encodings that encode all unicode chars (codepoints)
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list