On Thu, Mar 10, 2016 at 6:30 PM, Mark Lawrence <breamore...@yahoo.co.uk> wrote: >> From what I've seen, a lot of software can't get [Unicode] right anyway. >> > > Are you referring to PEP393 having taken notice of the RUE?
Even with PEP 393, there's no guarantee that a Python program will get Unicode right. The bytes/text split in Python 3 is a huge help, but proper handling of the entire Unicode range implies more than simply being able to represent all characters (although that's a critical prerequisite). There are design considerations with case folding (tip: it's easiest and safest to be case sensitive), collation/sorting (tip: it's impossible to be perfect unless you know which language is involved), text directionality (you probably know that Arabic is written right-to-left, but are you aware that there are also characters with "weak" directionality, distinct from those with "neutral" directionality?) and so on, plus a bunch of relatively straight-forward coding considerations (eg comparing two strings for equality generally requires NFC/NFC normalization, and might require NFKC/NFKD), which a number of programs still don't get right. PEP 393 actually isn't very much about correctness; a "wide build" of pre-3.3 Python has the correct behaviour, but is wasteful with memory. By removing the temptation to conserve memory using UTF-16, PEP 393 did improve correctness on Windows, but its main focus is on memory efficiency (and thus performance, thanks to cache locality). But hey. Just being able to represent all characters is probably about 95% of Unicode correctness. The rest is the little stuff. ChrisA -- https://mail.python.org/mailman/listinfo/python-list