On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Well, if Python can't, then who can? Probably nobody in the world, not > generically, anyway. > > Example: > > >>> print("re\u0301sume\u0301") > résumé > >>> print("r\u00e9sum\u00e9") > résumé > >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9") > False > >>> print("\ufb01nd") > find > >>> print("find") > find > >>> print("\ufb01nd" == "find") > False > > If equality can't be determined, words really can't be sorted.
Ah, that's a bit easier to deal with. Just use Unicode normalization. >>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == >>> unicodedata.normalize("NFC","r\u00e9sum\u00e9")) True It's a bit verbose, but if you're doing a lot of comparisons, you probably want to make a key-function that folds together everything that you want to be treated the same way, for instance: def key(s): """Normalize a Unicode string for comparison purposes. Composes, case-folds, and trims excess spaces. """ return unicodedata.normalize("NFC",s).strip().casefold() Then it's much tidier: >>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9")) True >>> print(key("\ufb01nd") == key("find")) True You may want to go further, too; for search comparisons, you'll want to use NFKC normalization, and probably translate all strings of Unicode whitespace into single U+0020s, or completely strip out zero-width non-breaking spaces (and maybe zero-width breaking spaces, too), etc, etc. It all depends on what you mean by "equality". But certainly a basic NFC or NFD normalization is safe for general work. ChrisA -- https://mail.python.org/mailman/listinfo/python-list