On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote: > On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> UTF-8 uses a flexible representation on a character-by-character basis. >> When parsing UTF-8, one needs to look at EVERY character to decide how >> many bytes you need to read. In Python 3, the flexible representation >> is on a string-by-string basis: once Python has looked at the string >> header, it can tell whether the *entire* string takes 1, 2 or 4 bytes >> per character, and the string is then fixed-width. You can't do that >> with UTF-8. > > UTF-8 does not use a flexible representation.
I disagree, and so does Jeremy Sanders who first pointed out the similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the Emacs documentation again: "To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc." And the Python FSR: "To conserve memory, Python does not hold fixed-length 21-bit numbers that are codepoints of text characters within buffers and strings. Rather, Python uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 4 8-bit bytes, depending on the magnitude of the largest codepoint in the string. For example, any all-ASCII or all-Latin1 string takes up only 1 byte per character, an all- BMP string takes up 2 bytes per character, etc." See the similarity now? Both flexibly change the width used by code- points, UTF-8 based on the code-point itself regardless of the rest of the string, Python based on the largest code-point in the string. [...] > Anyway, my point was just that Emacs is not a counter-example to jmf's > claim about implementing text editors, because UTF-8 is not what he (or > anybody else) is referring to when speaking of the FSR or "something > like the FSR". Whether JMF can see the similarities between different implementations of strings or not is beside the point, those similarities do exist. As do the differences, of course, but in this case the differences are in favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 implementation *cannot know that*, and still has to walk the string byte- by-byte checking whether the current code point requires 1, 2, 3, or 4 bytes, while a FSR implementation can simply record the fact that the string is pure Latin1 at creation time, and then treat it as fixed-width from then on. JMF claims that FSR is "impossible" to use efficiently, and yet he supports encoding schemes which are *less* efficient. Go figure. He tells us he has no problem with any of the established UTF encodings, and yet the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not UTF-16, since there are no surrogate pairs. But the difference is insignificant.) Having watched this issue from Day One when JMF first complained about it, I believe this is entirely about denying any benefit to ASCII users. Had Python implemented a system identical to the current FSR except that it added a fourth category, "all ASCII", which used an eight-byte encoding scheme (thus making ASCII strings twice as expensive as strings including code points from the Supplementary Multilingual Planes), JMF would be the scheme's number one champion. I cannot see any other rational explanation for why JMF prefers broken, buggy Unicode implementations, or implementations which are equally expensive for all strings, over one which is demonstrably correct, demonstrably saves memory, and for realistic, non-contrived benchmarks, demonstrably faster, except that he wants to punish ASCII users more than he wants to support Unicode users. -- Steven -- http://mail.python.org/mailman/listinfo/python-list