On Mon, 14 Oct 2019 10:05:49 +0300 Eli Zaretskii via Unicode <unicode@unicode.org> wrote:
> > Date: Mon, 14 Oct 2019 01:10:45 +0100 > > From: Richard Wordingham via Unicode <unicode@unicode.org> > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, > > and were expecting normalisation (even to NFC) to be a possible > > cure. They had begun to realise that converting expressions to > > match all or none of a set of canonical equivalents was hard; the > > issue of non-contiguous matches wasn't mentioned. > I think these are two separate issues: whether search should normalize > (a.k.a. performs character folding) should be a user option. You are > talking only about canonical equivalence, but there's also > compatibility decomposition, so, for example, searching for "1" should > perhaps match ¹ and ①. HERETIC! The official position is that text that is canonically equivalent is the same. There are problem areas where traditional modes of expression require that canonically equivalent text be treated differently. For these, it is useful to have tools that treat them differently. However, the normal presumption should be that canonically equivalent text is the same. The party line seems to be that most searching should actually be done using a 'collation', which brings with it different levels of 'folding'. In multilingual use, a collation used for searching should be quite different to one used for sorting. Now, there is a case for being able to switch off normalisation and canonical equivalence generally, e.g. when dealing with ISO 10646 text instead of Unicode text. This of course still leaves the question of what character classes defined by Unicode properties then mean. If one converts the regular expression so that what it matches is closed under canonical equivalence, then visibly normalising the searched text becomes irrelevant. For working with Unicode traces, I actually do both. I convert the text to NFD but report matches in terms of the original code point sequence; working this way simplifies the conversion of the regular expression, which I do as part of its compilation. For traces, it seems only natural to treat precomposed characters as syntactic sugar for the NFD decompositions. (They have no place in the formal theory of traces.) However, I go further and convert the decomposed text to NFD. (Recall that conversion to NFD can change the stored order of combining marks.) One of the simplifications I get is that straight runs of text in the regular expression then match in the middle just by converting that run and the searched strings. For the concatenation of expressions A and B, once I am looking at the possible interleaving of two traces, I am dealing with NFA states of the form states(A) × {1..254} × states(B), so that for an element (a, n, b), a corresponds to starts of words with a match in A, b corresponds to starts of _words_ with a match in B, and n is the ccc of the last character used to advance to b. The element n blocks non-starters that can't belong to a word matching A. If I didn't (internally) convert the searched text to NFD, the element n would have to be a set of blocked canonical combining classes, changing the number of possible values from 54 to 2^54 - 1. While aficionados of regular languages may object that converting the searched text to NFD is cheating, there is a theorem that if I have a finite automaton that recognises a family of NFD strings, there is another finite automaton that will recognise all their canonical equivalents. Richard.