On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote: > On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: >> >>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano >>> <steve+comp.lang.pyt...@pearwood.info> wrote: >>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: >>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers >>>>> that are codepoints of text characters within buffers and strings. >>>>> Rather, Emacs uses a variable-length internal representation of >>>>> characters, that stores each character as a sequence of 1 to 5 8-bit >>>>> bytes, depending on the magnitude of its codepoint[1]. For example, >>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes >>>>> up 2 bytes, etc. We call this representation of text multibyte. >>>> >>>> Well, you've just proven what Vim users have always suspected: Emacs >>>> doesn't really exist. >>> >>> ... lolwut? >> >> >> JMF has explained that it is impossible, impossible I say!, to write an >> editor using a flexible string representation. Since Emacs uses such a >> flexible string representation, Emacs is impossible, and therefore >> Emacs doesn't exist. >> >> QED. > > Except that the described representation used by Emacs is a variant of > UTF-8, not an FSR. It doesn't have three different possible encodings > for the letter 'a' depending on what other characters happen to be in > the string. > > As I understand it, jfm would be perfectly happy if Python used UTF-8 > (or presumably the Emacs variant) as its internal string representation.
UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. To put it in terms of pseudo-code: # Python 3.3 def parse_string(astring): # Decision gets made once per string. if astring uses 1 byte: count = 1 elif astring uses 2 bytes: count = 2 else: count = 4 while not done: char = convert(next(count bytes)) # UTF-8 def parse_string(astring): while not done: b = next(1 byte) # Decision gets made for every single char if uses 1 byte: char = convert(b) elif uses 2 bytes: char = convert(b, next(1 byte)) elif uses 3 bytes: char = convert(b, next(2 bytes)) else: char = convert(b, next(3 bytes)) So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's variation can in fact require more bytes per character than either. (UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.) I'm not surprised that JMF would prefer UTF-8 -- he is completely out of his depth, and is a fine example of the Dunning-Kruger effect in action. He is so sure he is right based on so little evidence. One advantage of UTF-8 is that for some BMP characters, you can get away with only three bytes instead of four. For transmitting data over the wire, or storage on disk, that's potentially up to a 25% reduction in space, which is not to be sneezed at. (Although in practice it's usually much less than that, since the most common characters are encoded to 1 or 2 bytes, not 4). But that comes at the cost of much more runtime overhead, which in my opinion makes UTF-8 a second-class string representation compared to fixed-width representations. -- Steven -- http://mail.python.org/mailman/listinfo/python-list