Re: Grapheme clusters, a.k.a.real characters

Terry Reedy Fri, 14 Jul 2017 14:15:53 -0700

On 7/14/2017 10:30 AM, Michael Torrie wrote:

On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?


As it stands, we have

    è --[encode>-- Unicode --[reencode>-- UTF-8

Why is one encoding format better than the other?

All digital data are ultimately bits, usually collected together ingroups of 8, called bytes. The point of python 3 is that text shouldnormally be instances of a text class, separate from the raw bytesclass, with a defined internal encoding. The actual internal encodingis secondary. And it changed in 3.3.

Python ints are encoded bytes, so are floats, and everything else. Whenone prints a float, one certainly does not see a representation of theraw bytes in the float object. Instead, one sees a representation ofthe value it represents. There is a proposal to change the internalencoding of int, as least on 64-bit machines, which are now standard.However, because print(87987282738472387429748) prints87987282738472387429748 and not the internal bytes, the change in theinternal bytes will not affect the user view of ints.

This is precisely the logic behind Google using UTF-8 for strings in Go,
rather than having some O(1) abstract type like Python has.  And many
other languages do the same.  The argument is that because of the very
issues that you mention, having O(1) lookup in a string isn't that
important, since looking up a particular index in a unicode string is
rarely the right thing to do, so UTF-8 is just fine as a native,
in-memory type.

Does go use bytes for text, like most people did in Python 2, a separatetext string class, that hides the internal encoding format andimplementation? In other words, if you do the equivalent of print(s)where s is a text string with a mixture of greek, cyrillic, hindi,chinese, japanese, and korean chars, do you see the characters, or somerepresentation of the internal bytes?



--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Reply via email to