Terry Reedy <tjre...@udel.edu>: > On 7/14/2017 10:30 AM, Michael Torrie wrote: >> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote: >>> Of course, UTF-8 in a bytes object doesn't make the situation any >>> better, but does it make it any worse? >> >>> >>> As it stands, we have >>> >>> รจ --[encode>-- Unicode --[reencode>-- UTF-8 >>> >>> Why is one encoding format better than the other? > > All digital data are ultimately bits, usually collected together in > groups of 8, called bytes.
Naturally. > The point of python 3 is that text should normally be instances of a > text class, separate from the raw bytes class, with a defined internal > encoding. And I called its usefulness into question. >> This is precisely the logic behind Google using UTF-8 for strings in Go, >> rather than having some O(1) abstract type like Python has. And many >> other languages do the same. The argument is that because of the very >> issues that you mention, having O(1) lookup in a string isn't that >> important, since looking up a particular index in a unicode string is >> rarely the right thing to do, so UTF-8 is just fine as a native, >> in-memory type. > > Does go use bytes for text, like most people did in Python 2, Yes. Also, C and the GNU textutils do that. > a separate text string class, that hides the internal encoding format > and implementation? In other words, if you do the equivalent of > print(s) where s is a text string with a mixture of greek, cyrillic, > hindi, chinese, japanese, and korean chars, do you see the characters, > or some representation of the internal bytes? Yes, in Python2, Go, C and GNU textutils, when you print a text string containing a mixture of languages, you see characters. Why? Because that's what the terminal emulator chooses to do upon receiving those bytes. Marko -- https://mail.python.org/mailman/listinfo/python-list