Andy Heninger <ahenin...@google.com> wrote: >>> I actually had do this because I have a dataset that has things like >>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to >>> head-match with "undead" and "smor", respectively. There is no >>> decomposition of "ð" that includes "d", nor any of "ø" that includes "o". >>> But the UCA primary strenths are the same. It worked very well. >> >>> It's a very useful feature, and I'm glad that tr18 includes mention of it. >>> I just wish we could get it into our regex engines so I didn't have to >>> do it all by hand. :)
> On Fri, Apr 15, 2011 at 8:01 AM, Mark Davis ☕ <m...@macchiato.com> wrote: >> The biggest issue is that for any transformation that changes the number of >> characters, or rearranges them is problematic, for the reasons outlined in >> the PRI. >> >> An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under >> a collation transform, strength 2) should match any of {aa, aA,...å, Å, >> Å,...}, as should /(a|b|c*(?=...)|...)(d|...|\x{308})/ >> >> What *is* relatively straightforward is to do is to construct a regex >> targeted at a known transformation (like NFC), and then transform the input >> text. There will be some difficulties in mapping between indexes for >> grouping, however. Most regex engines can't handle in their API >> discontiguous groups. > I suspect a match where the fundamental atomic unit of matching was grapheme > clusters, or combining sequences, would produce useful results. > No discontinuous results. Results independent of normalization form, or > lack of normalization, of the input. No ability of the match to look inside > of, or partially match, combining sequences. > I also think that we should avoid making recommendations that haven't been > implemented and proved to be useful and practical. I agree we should look at existing practice to see what people have come up with to see what does, and what does not, work. Last night I wrote, and then deleted, a great deal of text talking about this, and the solutions that I in practice had found useful. I decided it was too long and through it all away. All I really ended up saying is that UCA matches at collation strengths 1 and 2 had proven useful for me. There are two issues. One relates to decomposition, the other to UCA comparisons. Consider two situations. The user will think in graphemes, so I will, too. By grapheme, I mean a user-perceived "character". * Case one has the user wanting to match any grapheme starting with an "a". * Case two has the user wanting to match any grapheme starting with an "a" but which also has a circumflex. The first case appears to be reasonably easy. The second case probably does not. But I believe both are harder than they look. The obvious thing to do for case one, and the thing we've likely all done, is to use canonical decomposition. That is "safe" because the number of code points never changes when you take the NFD of a string. NFD($string) =~ /(?=a)\X/ or for embedding, then /(?:(?=a)\X)/; that's a "loose" match of an "a" that works no matter whether it is in NFC or NFD or something else. In fact, with case folding (for case insensitive matching) it even works for ANGSTROM SIGN, because that has an NFD that turns into a regular "A". If you pre-NFD the string, the matching engine doesn't have to account for NFD-matching. This breaks down in case two, though. However, even still it is not at all as easy as that, because there are many user-perceived characters it does not work with. Some of these do work with a compatibility decomposition, although others do not. Even when NFKD "works", you now have the problem of one grapheme mapping to multiple graphemes. Consider: ẚ 1E9A GC=Ll LATIN SMALL LETTER A WITH RIGHT HALF RING The NFD of that is the same, because there is no combining half ring. Instead there is a modifier letter, which is a separate grapheme. This is two graphemes: LATIN SMALL LETTER A MODIFIER LETTER RIGHT HALF RING Now, you cannot blame the user for not knowing whether Unicode happens to have an NFD for that which works, versus needing an NFKD. What do you do about that "a" match again? If you pre-NFKD it, things don't work at all. Look what happens with the same pattern: NFKD($string) =~ /(?=a)\X/ Now you match only LATIN SMALL LETTER A, leaving MODIFIER LETTER RIGHT HALF RING left over and unmatched. So you would have to do the decomposition, even and especially an NFKD decomposition, in the matching engine itself, not beforehand. That's because tou need to be able to group as one logical unit anything that is produced by the decomposition. NFKD decomposition does allow you to match these: a FF41 GC=Ll FULLWIDTH LATIN SMALL LETTER A ª 00AA GC=Ll FEMININE ORDINAL INDICATOR And even stuff like this: ㏂ 33C2 GC=So SQUARE AM Although again this makes you wonder what the whole match would be. If you match a "user-visible character" that starts with "a", shouldn't you get the rest of that, too? But there is still stuff that the user will perceive but which even NFKD won't do for you. That's stuff like these: æ 00E6 GC=Ll LATIN SMALL LETTER AE ꜳ A733 GC=Ll LATIN SMALL LETTER AA ꜵ A735 GC=Ll LATIN SMALL LETTER AO ꜷ A737 GC=Ll LATIN SMALL LETTER AU The user probably wants to be able to have those count as "a" code points. We have no decomposition that will get you there. This is unlike code points that these, which all decompose to something with two letters: ij 0133 GC=Ll LATIN SMALL LIGATURE IJ ʼn 0149 GC=Ll LATIN SMALL LETTER N PRECEDED BY APOSTROPHE DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z dz 01F3 GC=Ll LATIN SMALL LETTER DZ Which reminds me, this one is different: ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH But it seems *very* unlikely that there should be user-perceived difference between those 01F3 which decomposes to "dz", and 02A3, which does not. Only when you use the UCA for matching does this get sorted out. For example, these all produce the same UCA2 values as "dz" produces: DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z dz 01F3 GC=Ll LATIN SMALL LETTER DZ ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH and you cannot get there using NFKD. At UCA1, you of course get all these producing the same sort keys as "dz" produces: DŽ 01C4 GC=Lu LATIN CAPITAL LETTER DZ WITH CARON Dž 01C5 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON dž 01C6 GC=Ll LATIN SMALL LETTER DZ WITH CARON DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z dz 01F3 GC=Ll LATIN SMALL LETTER DZ ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH If the user wants to match "dz", any of those should match, even the last, which you can't get at with NFKD, just with UCA1 and UCA2. Which brings us around to case two where we'll encounter ordering problems. * Case two has the user wanting to match any grapheme starting with an "a" but which also has a circumflex. You are going to have to use grapheme mode again, of course, but there are troubles. You can't just say: NFD($string) =~ /(?=a\N{COMBINING CIRCUMFLEX ACCENT})\X/ or, more readably: NFD($string) =~ m{ (?= a \N{COMBINING CIRCUMFLEX ACCENT} ) \X }x because although that will work for super simple cases like LATIN SMALL LETTER A WITH CIRCUMFLEX LATIN SMALL LETTER A COMBINING CIRCUMFLEX ACCENT it doesn't work when we get something whose combining class causes an interposition between the "a" and the circumflex. We're safe with this: ẫ 1EAB GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE because that "fortunately" decomposes into LATIN SMALL LETTER A COMBINING CIRCUMFLEX ACCENT COMBINING TILDE but what if you were looking for an "a" with a tilde? Also, there are "infinitely" many combinations like this grapheme: LATIN SMALL LETTER A WITH TILDE COMBINING CIRCUMFLEX ACCENT which decomposes into LATIN SMALL LETTER A COMBINING TILDE COMBINING CIRCUMFLEX ACCENT And now we have a problem with our pattern. (Because COMBINING TILDE and COMBINING CIRCUMFLEX ACCENT are of the same combining class, their ordering matters: eg, "a\x{303}\x{302}" is "ã̂" but "a\x{302}\x{303}" is "ẫ", which should look different. They are not canonically equivalent.) In fact, this can even if you start with a single code point, like ậ 1EAD GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW which will canonically decompose into LATIN SMALL LETTER A COMBINING DOT BELOW COMBINING TILDE so for all those situations, you now need something rather fancier: NFD($string) =~ m{ (?= a \p{Grapheme_Extend} * \N{COMBINING CIRCUMFLEX ACCENT} ) \X }x Yes even that isn't good enough. Consider that all of these have the same UCA1 as "ae" has: Æ 00C6 GC=Lu LATIN CAPITAL LETTER AE æ 00E6 GC=Ll LATIN SMALL LETTER AE Ǣ 01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON ǣ 01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON Ǽ 01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE ǽ 01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE ᴭ 1D2D GC=Lm MODIFIER LETTER CAPITAL AE ◌ᷔ 1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE But that's just in non-locale UCA. With the German Phonebook locale, all these have the same UCA1 as "ae" has: Ä 00C4 GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS Æ 00C6 GC=Lu LATIN CAPITAL LETTER AE ä 00E4 GC=Ll LATIN SMALL LETTER A WITH DIAERESIS æ 00E6 GC=Ll LATIN SMALL LETTER AE Ǟ 01DE GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON ǟ 01DF GC=Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON Ǣ 01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON ǣ 01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON Ǽ 01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE ǽ 01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE ᴭ 1D2D GC=Lm MODIFIER LETTER CAPITAL AE ◌ᷔ 1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE So, now how do you match an "a" and a circumflex? You might have this grapheme: LATIN SMALL LETTER AE WITH MACRON COMBINING CIRCUMFLEX ACCENT Unlike dz 01F3 GC=Ll LATIN SMALL LETTER DZ there is no NFKD that gives you access to the contraction. In this way, LATIN SMALL LETTER AE is like ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH which also has no decomposition (despite being UCA1/UCA2 equiv to "dz"). So you have to go to the UCA. And you have to modify your pattern to do something like this, provided you want the "a" first: m{ (?= a \p{Grapheme_Base} * \p{Grapheme_Extend} * \N{COMBINING CIRCUMFLEX ACCENT} ) \X }x or like this if you don't care where the a is: m{ (?= \p{Grapheme_Base} * a \p{Grapheme_Base} * \p{Grapheme_Extend} * \N{COMBINING CIRCUMFLEX ACCENT} ) \X }x That presupposes the \X will be able to keep as a whole grapheme cluster even under NFKD and/or UCA anything that began as a single grapheme cluster before you began it. And all this is because we're trying to go at things in ways that don't surprise the user. If we're doing UCA cleverness with letters, then we should probably consider doing it with more than that. Except that the UCA doesn't really consider all these the same (no surprise). However, with at least some of them, I'm sure the user might: ^ 005E GC=Sk CIRCUMFLEX ACCENT ˆ 02C6 GC=Lm MODIFIER LETTER CIRCUMFLEX ACCENT ◌̂ 0302 GC=Mn COMBINING CIRCUMFLEX ACCENT ◌̭ 032D GC=Mn COMBINING CIRCUMFLEX ACCENT BELOW ◌᷍ 1DCD GC=Mn COMBINING DOUBLE CIRCUMFLEX ABOVE ^ FF3E GC=Sk FULLWIDTH CIRCUMFLEX ACCENT I hope this shows what can be done already, what cannot, and what is really rather difficult. I think this shows some of the challenges to meeting users' (perfectly reasonable) expectations of perceived characters. We really are going to do more in the regex engine, although I'm not completely certain what yet. I really hate the idea of having to recalculate decompositions, let alone take UCA keys, again and again. I'm afraid that one may have no choice, though. Imagine you have these three strings: crème brûlée boîte château And you want to be able to support not just a search for COMBINING CIRCUMFLEX ACCENT which would get all three, but also LATIN SMALL LETTER A WITH CIRCUMFLEX which would get only the last one. As things currently stand, you have to run every match twice, once on NFD and once on NFC. I find this troublesome. Hope this helps! --tom