Re: Grapheme clusters, a.k.a.real characters

Marko Rauhamaa Fri, 14 Jul 2017 12:08:30 -0700

Rhodri James <[email protected]>:

> On 14/07/17 15:14, Marko Rauhamaa wrote:
>> I'd like to understand this better. Maybe you have a couple of
>> examples to share?
>
> Sure.
>
> What I've mostly been looking at recently has been the Expat XML parser.
> XML chooses to deal with one of your problems by defining that it's not
> having anything to do with combining, sequences of codepoints are all
> you need to worry about when comparing strings.  U+00E8 (LATIN SMALL
> LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E)
> followed by U+0300 (COMBINING GRAVE ACCENT) for example.


Very interesting. The relevant W3C spec confirms what you said:

  5. Test the resulting sequences of code points bit-by-bit for identity.

  [...]

  This document therefore recommends, when possible, that all content be
  stored and exchanged in Unicode Normalization Form C (NFC).

  <URL: https://www.w3.org/TR/charmod-norm/>


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Reply via email to