Re: PEP 393 vs UTF-8 Everywhere

MRAB Fri, 20 Jan 2017 16:22:50 -0800

On 2017-01-20 23:06, Chris Kaynor wrote:

On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4+use...@gmail.com> wrote:

Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others on the
matter. What I have not found is discussion of pros and cons of
alternatives to the old narrow or wide implementation of Unicode
strings.


The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.

ISTM that most operations on strings are via iterators and thus agnostic
to variable or fixed width encodings. How important is it to be able to
get to part of a string with a simple index? Just because old skool
strings could be treated as a sequence of characters, is that a reason
to shoehorn the subtleties of Unicode into that model?


I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

You could implement rsplit and rstrip easily enough, but rfind andrindex return the index, so you'd need to scan the string to return that.

Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

As someone who has written an extension, I can tell you that I muchprefer dealing with a fixed number of bytes per codepoint than avariable number of bytes per codepoint, especially as I'm alsosupporting earlier versions of Python where that was the case.


--
https://mail.python.org/mailman/listinfo/python-list

Re: PEP 393 vs UTF-8 Everywhere

Reply via email to