Re: Proposed update to UTS#18

2011-04-26 Thread Tom Christiansen
Andy Heninger wrote: >>> I actually had do this because I have a dataset that has things like >>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to >>> head-match with "undead" and "smor", respectively. There is no >>> decomposition of "ð" that includes "d", nor any of "ø" that in

Re: Proposed update to UTS#18

2011-04-15 Thread Tom Christiansen
I hope you all know there is a lot of handwaving at the end of my last posting. :) That's because it isn't actually implementable as things stand. There's no current way to track what was a single grapheme before the regex gets its hands on it if that regex engine is doing some sort of decompositi

Re: Proposed update to UTS#18

2011-04-15 Thread Andy Heninger
On Fri, Apr 15, 2011 at 8:01 AM, Mark Davis ☕ wrote: > The biggest issue is that for any transformation that changes the number of > characters, or rearranges them is problematic, for the reasons outlined in > the PRI. > > An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (unde

Re: Proposed update to UTS#18

2011-04-15 Thread Mark Davis ☕
The biggest issue is that for any transformation that changes the number of characters, or rearranges them is problematic, for the reasons outlined in the PRI. An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under a collation tranform, stength 2) should match any of {aa, aA,.

Re: Proposed update to UTS#18

2011-04-14 Thread Tom Christiansen
Thanks, Mark. I've been trying to think about what to say to it. I'd like to more about what is planned in the "canonical matching" area. I do understand why reordering makes exact matching impossible. However, I should think one of several sort of loose matching might still be done. Maybe that

Proposed update to UTS#18

2011-04-14 Thread Mark Davis ☕
BTW, feedback is welcome on Proposed Update UTS #18: Unicode Regular Expressions Mark