On Sun, 7 Jun 2015 10:08 pm, Chris Angelico wrote: > On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <st...@pearwood.info> > wrote: >> My opinion is that a programming language like Python or ECMAScript >> should operate on *code points*. If we want to call them "characters" >> informally, that should be allowed, but whenever there is ambiguity we >> should remember we're dealing with code points. The implementation >> shouldn't matter: compliant Python interpreters might choose to use UTF-8 >> internally, or UTF-16, or UTF-32, or something else, and still agree on >> how many characters a string contains. Normalisation is still an issue, >> of course, but any decent Unicode implementation will include a way to >> normalise or denormalise strings. > > If by "normalise" you mean the NF[K]C/NF[K]D composition and > decomposition, then yes, any decent Unicode library will provide that.
Dat's der bunny! > I'm not sure it's critical to string handling itself, though; and > Python defers the operation to the unicodedata module: > >>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}" >>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}" >>>> s1 == s2 > False >>>> unicodedata.normalize("NFC", s1) == s2 > True > > It's a useful operation to be able to do, but I would never expect > that *string comparison* or other operations should automatically > normalize. I completely agree. It might be convenient to have a string equality method that did normalisation, but for most cases it would be unnecessary and slow. I think that's the sort of thing which should be left to a subclass of str, and it should normalise on construction. > (Unless you want to say that all strings are guaranteed to > be NFC/NFD normalized, such that s1 and s2 would actually be > identical, which I suppose is plausible. I'm not sure what the > advantage would be, though. And certainly you wouldn't want to > K-normalize strings automatically.) I believe that filenames on Apple file systems (HFS+ if I remember correctly) are guaranteed to be both normalised and correctly encoded as UTF-8. If you could live in a purely Apple world, you'd have far fewer filename hassles. -- Steven -- https://mail.python.org/mailman/listinfo/python-list