On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation?
I've read over the PEP, and the email discussion, and there is very little mention of UTF-8, and as far as I can see no counter-proposal for using UTF-8. However, there are a few mentions of UTF-8 that suggest that the participants were aware of it as an alternative, and simply didn't think it was worth considering. I don't know why. You can read the PEP and the mailing list discussion here: The PEP: https://www.python.org/dev/peps/pep-0393/ Mailing list discussion starts here: https://mail.python.org/pipermail/python-dev/2011-January/107641.html Stefan Behnel (author of Cython) states that UTF-8 is much harder to use: https://mail.python.org/pipermail/python-dev/2011-January/107739.html I see nobody challenging that claim, so perhaps there was simply enough broad agreement that UTF-8 would have been more work and so nobody wanted to propose it. I'm just guessing though. Perhaps it would have been too big a change to adapt the CPython internals to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32 implementations? (I know that UTF-16 is actually variable-width, but Python prior to PEP 393 treated it as if it were fixed.) There was a much earlier discussion about the internal implementation of Unicode strings: https://mail.python.org/pipermail/python-3000/2006-September/003795.html including some discussion of UTF-8: https://mail.python.org/pipermail/python-3000/2006-September/003816.html It too proposed using a three-way internal implementation, and made it clear that O(1) indexing was an requirement. Here's a comment explicitly pointing out that constant-time indexing is wanted, and that using UTF-8 with a two-level table destroys any space advantage UTF-8 might have: https://mail.python.org/pipermail/python-3000/2006-September/003822.html Ironically, Martin v. Löwis, the author of PEP 393 originally started off opposing an three-way internal representation, calling it "terrible": https://mail.python.org/pipermail/python-3000/2006-September/003891.html Another factor which I didn't see discussed anywhere is that Python strings treat surrogates as normal code points. I believe that would be troublesome for a UTF-8 implementation: py> '\uDC37'.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in position 0: surrogates not allowed but of course with a UCS-2 or UTF-32 implementation it is trivial: you just treat the surrogate as another code point like any other. [...] > ISTM that most operations on strings are via iterators and thus agnostic > to variable or fixed width encodings. Slicing is not. start = text.find(":") end = text.rfind("!") assert end > start chunk = text[start:end] But even with iteration, we still would expect that indexes be consecutive: for i, c in enumerate(text): assert c == text[i] The complexity of those functions will be greatly increased with UTF-8. Of course you can make it work, and you can even hide the fact that UTF-8 has variable-width code points. But you can't have all three of: - simplicity; - memory efficiency; - O(1) operations with UTF-8. But of course, I'd be happy for a competing Python implementation to use UTF-8 and prove me wrong! -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list