Chris Kaynor <ckay...@zindagigames.com> writes: > On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4+use...@gmail.com> wrote: >> Can anyone point me at a rationale for PEP 393 being incorporated in >> Python 3.3 over using UTF-8 as an internal string representation? >> I've found good articles by Nick Coghlan, Armin Ronacher and others >> on the matter. What I have not found is discussion of pros and cons >> of alternatives to the old narrow or wide implementation of Unicode >> strings. > > The PEP itself has the rational for the problems with the narrow/wide > idea, the quote from https://www.python.org/dev/peps/pep-0393/: There > are two classes of complaints about the current implementation of the > unicode type:on systems only supporting UTF-16, users complain that > non-BMP characters are not properly supported. On systems using UCS-4 > internally (and also sometimes on systems using UCS-2), there is a > complaint that Unicode strings take up too much memory - especially > compared to Python 2.x, where the same code would often use ASCII > strings (i.e. ASCII-encoded byte strings). With the proposed approach, > ASCII-only Unicode strings will again use only one byte per character; > while still allowing efficient indexing of strings containing non-BMP > characters (as strings containing them will use 4 bytes per > character). > > Basically, narrow builds had very odd behavior with non-BMP > characters, namely that indexing into the string could easily produce > mojibake. Wide builds used quite a bit more memory, which generally > translates to reduced performance.
I'm taking as a given that the old way was often sub-optimal in many scenarios. My questions were about the alternatives, and why PEP 393 was chosen over other approaches. >> ISTM that most operations on strings are via iterators and thus >> agnostic to variable or fixed width encodings. How important is it to >> be able to get to part of a string with a simple index? Just because >> old skool strings could be treated as a sequence of characters, is >> that a reason to shoehorn the subtleties of Unicode into that model? > > I think you are underestimating the indexing usages of strings. Every > operation on a string using UTF8 that contains larger characters must > be completed by starting at index 0 - you can never start anywhere > else safely. rfind/rsplit/rindex/rstrip and the other related reverse > functions would require walking the string from start to end, rather > than short-circuiting by reading from right to left. With indexing > becoming linear time, many simple algorithms need to be written with > that in mind, to avoid n*n time. Such performance regressions can > often go unnoticed by developers, who are likely to be testing with > small data, and thus may cause (accidental) DOS attacks when used on > real data. The exact same problems occur with the old narrow builds > (UTF16; note that this was NOT implemented in those builds, however, > which caused the mojibake problems) as well - only a UTF32 or PEP393 > implementation can avoid those problems. I was asserting that most useful operations on strings start from index 0. The r* operations would not be slowed down that much as UTF-8 has the useful property that attempting to interpret from a byte that is not at the start of a sequence (in the sense of a code point rather than Python) is invalid and so quick to move over while working backwards from the end. The only significant use of an index dereference that I could come up with was the result of a find() or index(). I put out this public question so that I could be enclued as to other uses. My personal experience is that in most cases where I might consider find() that I end up using re and use the return from match groups which has copies of the (sub)strings that I want. > Note that from a user (including most developers, if not almost all), > PEP393 strings can be treated as if they were UTF32, but with many of > the benefits of UTF8. As far as I'm aware, it is only developers > writing extension modules that need to care - and only then if they > need maximum performance, and thus cannot convert every string they > access to UTF32 or UTF8. PEP 393 already says that "the specification chooses UTF-8 as the recommended way of exposing strings to C code". -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list