On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote: > I am convinced you are not conceptually understanding utf-8 very well. I > wrote many times, "utf-8 does not produce bytes, but Unicode Encoding > Units".
Just because you write it many times, doesn't make it correct. You are simply wrong. UTF-8 produces bytes. That's what gets written to files and transmitted over networks, bytes, not "Unicode Encoding Units", whatever they are. > A similar coding scheme: iso-6937 . > > Try to write an editor, a text widget, with with a coding scheme like > the Flexible String Represenation. You will quickly notice, it is > impossible (understand correctly). (You do not need a computer, just a > sheet of paper and a pencil) Hint: what is the character at the caret > position? That is a simple index operation into the buffer. If the caret position is 10 characters in, you index buffer[10-1] and it will give you the character to the left of the caret. buffer[10] will give you the character to the right of the caret. It is simple, trivial, and easy. The buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4 bytes. Here is an example of such a tiny buffer, implemented in Python 3.3 with the hated Flexible String Representation. In each example, imagine the caret is five characters from the left: 12345|more characters here... It works regardless of whether your characters are ASCII: py> buffer = '12345ABCD...' py> buffer[5-1] # character to the left of the caret '5' py> buffer[5] # character to the right of the caret 'A' Latin 1: py> buffer = '12345áßçð...' py> buffer[5-1] # character to the left of the caret '5' py> buffer[5] # character to the right of the caret 'á' Other BMP characters: py> buffer = '12345αдᚪ∞...' py> buffer[5-1] # character to the left of the caret '5' py> buffer[5] # character to the right of the caret 'α' And Supplementary Plane Characters: py> buffer = ('12345' ... '\N{ALCHEMICAL SYMBOL FOR AIR}' ... '\N{ALCHEMICAL SYMBOL FOR FIRE}' ... '\N{ALCHEMICAL SYMBOL FOR EARTH}' ... '\N{ALCHEMICAL SYMBOL FOR WATER}' ... '...') py> buffer '12345🜁🜂🜃🜄...' py> len(buffer) 12 py> buffer[5-1] # character to the left of the caret '5' py> buffer[5] # character to the right of the caret '🜁' py> unicodedata.name(buffer[5]) 'ALCHEMICAL SYMBOL FOR AIR' And it all Just Works in Python 3.3. So much for "impossible to tell" what the character at the carat is. It is *trivial*. Ah, but how about Python 3.2? We set up the same buffer: py> buffer = ('12345' ... '\N{ALCHEMICAL SYMBOL FOR AIR}' ... '\N{ALCHEMICAL SYMBOL FOR FIRE}' ... '\N{ALCHEMICAL SYMBOL FOR EARTH}' ... '\N{ALCHEMICAL SYMBOL FOR WATER}' ... '...') py> buffer '12345🜁🜂🜃🜄...' py> len(buffer) 16 Sixteen? Sixteen? Where did the extra four characters come from? They came from *surrogate pairs*. py> buffer[5-1] # character to the left of the caret '5' py> buffer[5] # character to the right of the caret '\ud83d' Funny, that looks different. py> unicodedata.name(buffer[5]) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name No name? Because buffer[5] is only *half* of the surrogate pair. It is broken, and there is really no way of fixing that breakage in Python 3.2 with a narrow build. You can fix it with a wide build, but only at the cost of every string, every name, using double the amount of storage, whether it needs it or not. -- Steven -- http://mail.python.org/mailman/listinfo/python-list