On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <st...@pearwood.info> wrote: > My opinion is that a programming language like Python or ECMAScript should > operate on *code points*. If we want to call them "characters" informally, > that should be allowed, but whenever there is ambiguity we should remember > we're dealing with code points. The implementation shouldn't matter: > compliant Python interpreters might choose to use UTF-8 internally, or > UTF-16, or UTF-32, or something else, and still agree on how many > characters a string contains. Normalisation is still an issue, of course, > but any decent Unicode implementation will include a way to normalise or > denormalise strings.
If by "normalise" you mean the NF[K]C/NF[K]D composition and decomposition, then yes, any decent Unicode library will provide that. I'm not sure it's critical to string handling itself, though; and Python defers the operation to the unicodedata module: >>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}" >>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}" >>> s1 == s2 False >>> unicodedata.normalize("NFC", s1) == s2 True It's a useful operation to be able to do, but I would never expect that *string comparison* or other operations should automatically normalize. (Unless you want to say that all strings are guaranteed to be NFC/NFD normalized, such that s1 and s2 would actually be identical, which I suppose is plausible. I'm not sure what the advantage would be, though. And certainly you wouldn't want to K-normalize strings automatically.) > The question of graphemes (what "ordinary people" consider letters and > characters, e.g. "ch" is two letters to an English speaker but one letter > to a Czech speaker) should be left to libraries. It's a much harder problem > to solve in the full general case, requires localisation, and is overkill > for many string-processing tasks. Yeah. The basic challenge to a beginning programmer, "reverse this string", becomes rather tricky in the presence of natural language. >>> s1 += "e" >>> s1 'áe' >>> s1[::-1] 'éa' Oops. But hey. It's easier to understand what went wrong here than, say, if you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16 stream. If you're lucky, those would give you instant errors... if you're not, well, who knows. ChrisA -- https://mail.python.org/mailman/listinfo/python-list