On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
> On Oct 13, 2019, at 12:02, Steve Jorgensen <[email protected]> wrote:
[...]
> > This proposal is a serious breakage of backward compatibility, so
> > would be something for Python 4.x, not 3.x.
>
> I’m pretty sure almost nobody wants a 3.0-like break again, so this
> will probably never happen.
Indeed, and Guido did rule some time ago that 4.0 would be ordinary
transition, like 3.7 to 3.8, not a big backwards breaking version
change.
I've taken up referring to some hypothetical future 3.0-like version as
Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just
how far away it will be.
> And finally, if you want to break strings, it’s probably worth at
> least considering making UTF-8 strings first-class objects. They can’t
> be randomly accessed,
I don't see why you can't make arrays of UTF-8 indexable and provide
random access to any code point. I understand that ``str`` in
Micropython is implemented that way.
The obvious implementation means that you lose O(1) indexing (to reach
the N-th code point, you have to count from the beginning each time) but
save memory over other encodings. (At worst, a code-point in UTF-8 takes
three bytes, compared to four in UTF-16 or UTF-32.) There are ways to
get back O(1) indexing, but they cost more memory.
But why would you want an explicit UTF-8 string object? What benefit
do you get from exposing the fact that the implementation happens to be
UTF-8 rather than something else? (Not rhetorical questions.)
If the UTF-8 object operates on the basis of Unicode code points, then
its just a str, and the implementation is just an implementation detail.
If the UTF-8 object operates on the basis of raw bytes, with no
protection against malformed UTF-8 (e.g. allowing you to insert bytes
0x80-0xFF which are never valid in UTF-8, or by splitting apart a two-
or three-byte UTF-8 sequence) then its just a bytes object (or
bytearray) initialised with a UTF-8 sequence.
That is, as I understand it, what languages like Go do. To paraphrase,
they offer data types they *call* UTF-8 strings, except that they can
contain arbitrary bytes and be invalid UTF-8. We can already do this,
today, without the deeply misleading name:
string.encode('utf-8')
and then work with the bytes. I think this is even quite efficient in
CPython's "Flexible string representation". For ASCII-only strings, the
UTF-8 encoding uses the same storage as the original ASCII bytes. For
others, the UTF-8 representation is cached for later use.
So I don't see any advantage to this UTF-8 object. If the API works on
code points, then it's just an implementation detail of str; if the API
works on code units, that's just a fancy name for bytes. We already have
both str and bytes so what is the purpose of this utf8 object?
--
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/RKY73YB2UVJMZ2PNIYJ74AFVKUAIK45K/
Code of Conduct: http://python.org/psf/codeofconduct/