Re: Unicode questions

Terry Reedy Tue, 19 Oct 2010 15:49:07 -0700

On 10/19/2010 4:31 PM, Tobiah wrote:

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala Python's
`unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is not
directly defined for such an abstract representation (Indeed, this is
exactly what encodings are: mappings between abstract codepoints and
concrete binary; the problem is, there's more than one of them).


Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points.  So why so many?  I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough?  Languages may vary widely but as far as I know, computers
really don't that much.  big/little endian is the only problem I
can think of.  A byte is a byte.  So why so many encoding schemes?
Do some provide advantages to certain human languages?

The hundred or so language-specific encodings all pre-date unicode andare *not* unicode encodings. They are still used because of inertia andlocal optimization.

There are currently about 100000 unicode codepoints, with space forabout 1,000,000. The unicode standard specifies exactly 2 internalrepresentations of codepoints using either 16 or 32 bit words. Thelatter uses one word per codepoint, the former usually uses one word buthas to use two for codepoints above 2**16-1. The standard also specifiesabout 7 byte-oriented transer formats, UTF-8,16,32 with big and littleendian variations. As far as I know, these (and a few other variations)are the only encodings that encode all unicode chars (codepoints)


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode questions

Reply via email to