Rhodri James <rho...@kynesim.co.uk>: > On 14/07/17 15:14, Marko Rauhamaa wrote: >> I'd like to understand this better. Maybe you have a couple of >> examples to share? > > Sure. > > What I've mostly been looking at recently has been the Expat XML parser. > XML chooses to deal with one of your problems by defining that it's not > having anything to do with combining, sequences of codepoints are all > you need to worry about when comparing strings. U+00E8 (LATIN SMALL > LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E) > followed by U+0300 (COMBINING GRAVE ACCENT) for example.
Very interesting. The relevant W3C spec confirms what you said: 5. Test the resulting sequences of code points bit-by-bit for identity. [...] This document therefore recommends, when possible, that all content be stored and exchanged in Unicode Normalization Form C (NFC). <URL: https://www.w3.org/TR/charmod-norm/> Marko -- https://mail.python.org/mailman/listinfo/python-list