On Sun, Mar 20, 2016 at 1:56 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Steven D'Aprano <st...@pearwood.info>: > >> On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote: >>> When glorifying Python's advanced Unicode capabilities, are we >>> careful to emphasize the necessity of unicodedata.normalize() >>> everywhere? Should Python normalize strings unconditionally and >>> transparently? What does the O(1) character lookup mean under >>> normalization? >>> >>> Some weeks ago I had to spend 30 minutes to debug my Python program >>> when a user complained it didn't work. Turns out they had >>> accidentally invoked the program using a space and a composing tilde >>> instead of the ASCII ~. There was no visual indication of a problem >>> on the screen, but the Python program acted up. >> >> We recently had somebody here who wrote capital I by pressing the >> lower case l on the keyboard. Should a pure-ASCII program be able to >> operate without malfunction if the user confuses 0 and O, or I l and >> 1? What about ' and ` or possibly even '' and "? > > What I'm talking about is that maybe Python should treat canonically > equivalent strings equivalently, that is, indistinguishably under any > external inspection. > > Anyway, Python's Unicode support is great thing, but Unicode is a big > can of worms. Far from being a paradise, it's more of a case of picking > your poison.
I don't believe they should be *automatically* equivalent. A Unicode string is not a 2D collection of pixels, so it shouldn't be compared for equality visually; nor should it automatically do other transformations. The exact form of equivalence you want is the application's choice, and there it should remain. You would be absolutely *horrified* if Python started stripping leading/trailing spaces from strings before comparing them, yet I have no doubt that you've written programs that did exactly this. (And PHP does indeed to transformations like this, unless you use the === operator. Out of luck if you want to use <= or >= to order strings.) Some applications will benefit from NFC normalization; others from NFKC. Keep it in the application's hands, keep the language simple, and give the power to the programmer. Note, by the way, that the language itself does some normalization on identifiers: >>> exec("a\u0301 = 1234; print(\u00e1)") 1234 But programmer-controlled strings are, well, programmer-controlled. ChrisA -- https://mail.python.org/mailman/listinfo/python-list