[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Steven D'Aprano Sat, 26 Oct 2019 16:29:59 -0700

On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
> On Oct 13, 2019, at 12:02, Steve Jorgensen <[email protected]> wrote:
[...]
> > This proposal is a serious breakage of backward compatibility, so 
> > would be something for Python 4.x, not 3.x.
> 
> I’m pretty sure almost nobody wants a 3.0-like break again, so this 
> will probably never happen.


Indeed, and Guido did rule some time ago that 4.0 would be ordinary 
transition, like 3.7 to 3.8, not a big backwards breaking version 
change.

I've taken up referring to some hypothetical future 3.0-like version as 
Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just 
how far away it will be.


> And finally, if you want to break strings, it’s probably worth at 
> least considering making UTF-8 strings first-class objects. They can’t 
> be randomly accessed, 

I don't see why you can't make arrays of UTF-8 indexable and provide 
random access to any code point. I understand that ``str`` in 
Micropython is implemented that way.

The obvious implementation means that you lose O(1) indexing (to reach 
the N-th code point, you have to count from the beginning each time) but 
save memory over other encodings. (At worst, a code-point in UTF-8 takes 
three bytes, compared to four in UTF-16 or UTF-32.) There are ways to 
get back O(1) indexing, but they cost more memory.

But why would you want an explicit UTF-8 string object? What benefit 
do you get from exposing the fact that the implementation happens to be 
UTF-8 rather than something else? (Not rhetorical questions.)

If the UTF-8 object operates on the basis of Unicode code points, then 
its just a str, and the implementation is just an implementation detail.

If the UTF-8 object operates on the basis of raw bytes, with no 
protection against malformed UTF-8 (e.g. allowing you to insert bytes 
0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- 
or three-byte UTF-8 sequence) then its just a bytes object (or 
bytearray) initialised with a UTF-8 sequence.

That is, as I understand it, what languages like Go do. To paraphrase, 
they offer data types they *call* UTF-8 strings, except that they can 
contain arbitrary bytes and be invalid UTF-8. We can already do this, 
today, without the deeply misleading name:

    string.encode('utf-8')

and then work with the bytes. I think this is even quite efficient in 
CPython's "Flexible string representation". For ASCII-only strings, the 
UTF-8 encoding uses the same storage as the original ASCII bytes. For 
others, the UTF-8 representation is cached for later use.

So I don't see any advantage to this UTF-8 object. If the API works on
code points, then it's just an implementation detail of str; if the API 
works on code units, that's just a fancy name for bytes. We already have 
both str and bytes so what is the purpose of this utf8 object?


-- 
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/RKY73YB2UVJMZ2PNIYJ74AFVKUAIK45K/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to