On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > UTF-8 uses a flexible representation on a character-by-character basis. > When parsing UTF-8, one needs to look at EVERY character to decide how > many bytes you need to read. In Python 3, the flexible representation is > on a string-by-string basis: once Python has looked at the string header, > it can tell whether the *entire* string takes 1, 2 or 4 bytes per > character, and the string is then fixed-width. You can't do that with > UTF-8.
UTF-8 does not use a flexible representation. A codec that is encoding a string in UTF-8 and examining a particular character does not have any choice of how to encode that character; there is exactly one sequence of bits that is the UTF-8 encoding for the character. Further, for any given sequence of code points there is exactly one sequence of bytes that is the UTF-8 encoding of those code points. In contrast, with the FSR there are as many as three different sequences of bytes that encode a sequence of code points, with one of them (the shortest) being canonical. That's what makes it flexible. Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or "something like the FSR". -- http://mail.python.org/mailman/listinfo/python-list