On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode <unicode@unicode.org> wrote:
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING CIRCUMFLEX ACCENT>. Now, I could invent a string property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). I don't entirely understand what you said; you may have missed the distinction between "[:Lu:] can then match" and "[:Lu:] will then match". I think only Greek letters expand to 4 characters in NFD. When I'm respecting canonical equivalence/working with traces, I want [:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical equivalent <U+0E49, U+0E39>. The canonical closure of that sequence can be messy even within scripts. Some pairs commute: others don't, usually for good reasons. Regards, Richard.