Chris Angelico wrote: > On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> (E.g. there are millions of existing files across the world containing >> text which use legacy encodings that are not compatible with Unicode.) > > Not compatible with Unicode? There aren't many character sets out > there that include characters not in Unicode - that was the whole > point. Of course, there are plenty of files in unspecified eight-bit > encodings, so you may have a problem with reliable decoding - but if > you know what the encoding is, you ought to be able to represent each > character in Unicode.
What I meant was that some encodings -- namely ASCII and Latin-1 -- the ordinals are exactly equivalent to Unicode, that is: # Python 3 for i in range(128): assert chr(i).encode('ASCII') == bytes([i]) for i in range(256): assert chr(i).encode('Latin-1') == bytes([i]) That's not quite as significant as I thought, though. What is significant is that a pure ASCII file on disk can be read by a program assuming UTF-8: for i in range(128): assert chr(i).encode('UTF-8') == bytes([i]) although the same is not the case for Latin-1 encoded files. > Not compatible with any of the UTFs, that's different. Plenty of that > in the world. > >> You are certainly correct that in it's full generality, "text" is much >> more than just a string of code points. Unicode strings is a primitive >> data type. A powerful and sophisticated text processing application may >> even find Python strings too primitive, possibly needing something like >> ropes of graphemes rather than strings of code points. > > That's probably more an efficiency point, though. It should be > possible to do a perfect two-way translation between your grapheme > rope and a Python string; otherwise, you'll have great difficulty > saving your file to the disk (which will normally involve representing > the text in Unicode, then encoding that to bytes). Well, yes. My point, agreeing with Marko, is that any time you want to do something even vaguely related to human-readable text, "code points" are not enough. For example, if I give a string containing the following two code points in this order: LATIN SMALL LETTER E COMBINING CIRCUMFLEX ACCENT then my application should treat that as a single "character" and display it as: LATIN SMALL LETTER E WITH CIRCUMFLEX which looks like this: ê rather than two distinct "characters" eˆ Now, that specific example is a no-brainer, because the Unicode normalization routines will handle the conversion. But not every combination of accented characters has a canonical combined form. What about something like this? 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}' If I insert a character into my string, I want to be able to insert before the w or after the caron, but not in the middle of those three code points. -- Steven -- https://mail.python.org/mailman/listinfo/python-list