On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > (E.g. there are millions of existing files across the world containing text > which use legacy encodings that are not compatible with Unicode.)
Not compatible with Unicode? There aren't many character sets out there that include characters not in Unicode - that was the whole point. Of course, there are plenty of files in unspecified eight-bit encodings, so you may have a problem with reliable decoding - but if you know what the encoding is, you ought to be able to represent each character in Unicode. Not compatible with any of the UTFs, that's different. Plenty of that in the world. > You are certainly correct that in it's full generality, "text" is much more > than just a string of code points. Unicode strings is a primitive data > type. A powerful and sophisticated text processing application may even > find Python strings too primitive, possibly needing something like ropes of > graphemes rather than strings of code points. That's probably more an efficiency point, though. It should be possible to do a perfect two-way translation between your grapheme rope and a Python string; otherwise, you'll have great difficulty saving your file to the disk (which will normally involve representing the text in Unicode, then encoding that to bytes). To be sure, a Python string is a poor representational form for a text editor. But that's largely because it's immutable, so every little edit would involve massive copying. Depending on what you're doing, it might be worth using a chunked UTF-8 byte stream (allowing for insertion at any chunk boundary), or an array of lines, or something grapheme-based... but all of those questions are performance, not correctness, issues. > We Western and Northern European speakers -- and I don't know whether Finns > are counted as Northern Europeans or Eastern Europeans -- are lucky in that > our natural languages are well-covered by Unicode. All our graphemes are > also code points, even the "funny ones with accents". As an English > speaker. I have to remind myself that not every grapheme is a single code > point, but Devanagari or Navajo writers will never make that mistake. I've been working with different languages a bit, lately. Broadly speaking, you have: 1) Languages which use the Roman alphabet, plus a handful of other characters (eg Finnish, German). These can be represented largely in ASCII, and used to be handled fairly easily with a single codepage - an eight-bit ASCII-compatible encoding. 2) Languages which use a different alphabet (eg Cyrillic - Russian, Bulgarian). You could possibly cram them into an eight-bit encoding without tipping ASCII out, but I'm not sure. In Unicode, these languages are all easily supported by the BMP, as they don't use a huge number of characters each. 3) Languages which use a non-alphabetic system (eg Korean). I think they're all still covered by the BMP, but there's no way you can fit them into eight-bit encodings - one single language will use more than 256 symbols. 4) Ancient, esoteric, or symbolic writing systems. Not fundamentally different from the above categories except that they're less used, and the BMP has finite space. These will definitely need the SMP. But all of them are covered by Unicode. (Sadly, they are NOT all covered by all fonts, so I've been finding that certain pieces of text come out as strings of little boxes. But I can at least manipulate the text, even if I can't read it back.) I can, for example, zip lines of text like this: English: Let it go, let it go! I am one with the wind and sky Let it go, let it go! You'll never see me cry! Icelandic: Þetta er nóg, þetta er nóg Uppi í himni eins og vindablær Þetta er nóg, komið nóg Og tár mín enginn sér fær Russian: Отпусти и забудь, Этот мир из твоих грёз. Отпусти и забудь, И не будет больше слёз. Output: Let it go, let it go! Þetta er nóg, þetta er nóg Отпусти и забудь, I am one with the wind and sky Uppi í himni eins og vindablær Этот мир из твоих грёз. Let it go, let it go! Þetta er nóg, komið nóg Отпусти и забудь, You'll never see me cry! Og tár mín enginn sér fær И не будет больше слёз. In fact, it's trivially easy to write something like this, because all this text is Unicode. ALL of these languages (and plenty more) are "well-covered by Unicode". There's still the ongoing debate of Han unification, plus the progressive work of adding characters for ancient scripts and such, but AFAIK, all writing systems currently in use are covered. ChrisA -- https://mail.python.org/mailman/listinfo/python-list