On 7/14/2017 10:30 AM, Michael Torrie wrote:
On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?
As it stands, we have
รจ --[encode>-- Unicode --[reencode>-- UTF-8
Why is one encoding format better than the other?
All digital data are ultimately bits, usually collected together in
groups of 8, called bytes. The point of python 3 is that text should
normally be instances of a text class, separate from the raw bytes
class, with a defined internal encoding. The actual internal encoding
is secondary. And it changed in 3.3.
Python ints are encoded bytes, so are floats, and everything else. When
one prints a float, one certainly does not see a representation of the
raw bytes in the float object. Instead, one sees a representation of
the value it represents. There is a proposal to change the internal
encoding of int, as least on 64-bit machines, which are now standard.
However, because print(87987282738472387429748) prints
87987282738472387429748 and not the internal bytes, the change in the
internal bytes will not affect the user view of ints.
This is precisely the logic behind Google using UTF-8 for strings in Go,
rather than having some O(1) abstract type like Python has. And many
other languages do the same. The argument is that because of the very
issues that you mention, having O(1) lookup in a string isn't that
important, since looking up a particular index in a unicode string is
rarely the right thing to do, so UTF-8 is just fine as a native,
in-memory type.
Does go use bytes for text, like most people did in Python 2, a separate
text string class, that hides the internal encoding format and
implementation? In other words, if you do the equivalent of print(s)
where s is a text string with a mixture of greek, cyrillic, hindi,
chinese, japanese, and korean chars, do you see the characters, or some
representation of the internal bytes?
--
Terry Jan Reedy
--
https://mail.python.org/mailman/listinfo/python-list