Re: Matching subpatterns in any order, conjunctions, negated matches

William Michels via perl6-users Sat, 16 May 2020 13:44:25 -0700

On Fri, May 15, 2020 at 7:33 PM Joseph Brenner <doom...@gmail.com> wrote:
>
> Regex engines by their nature care a lot about order, but I
> occasionally want to relax that to match for multiple
> multicharacter subpatterns where the order of them doesn't
> matter.
>
> Frequently the simplest thing to do is just to just do multiple
> matches.   Let's say you're looking for words that have a "qu" a
> "th" and also, say an "ea".  This works:
>
>   my $DICT  = "/usr/share/dict/american-english";
>   my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>   say @hits;
>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
> earthquakes]
>
>
> It could be useful to be able to do it as one match though, for
> example, you might be using someone else's routine which takes a
> single regex as argument.  I've been known to write things like
> this:
>
>   my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
>                          [ qu .*? ea .*? th ] |
>                          [ th .*? qu .*? ea ] |
>                          [ th .*? ea .*? qu ] |
>                          [ ea .*? th .*? qu ] |
>                          [ ea .*? qu .*? th ]  };
>   my @hits = $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});
>
> That works, but it gets unwieldy quickly if you need to scale up
> the number of subpatterns.
>
> Recently though, I noticed the "conjunctions" feature, and it
> occured to me that this could be a very neat way of handling
> these things:
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };
>
> That's certainly much better, though unfortunately each element
> of the conjunction needs to match a substring of the same length,
> so pretty frequently you're stuck with the visual noise of
> bracketing subpatterns with pairs of .*
>
> Where things get interesting is when you want a negated match of
> one of the subpatterns.  One of the things I like about the first
> approach using multiple chained greps is that it's easy to do a
> reverse match.  What if you want words with "qu" and "th" but
> want to *skip* ones with an "ea"?
>
>   my @hits = $DICT.IO.open( :r 
> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>   # [Asquith discotheque discotheque's discotheques quoth]
>
> To do that in one regex, it would be nice if there were some sort
> of adverb to do a reverse match, like say :not, then it
> would be straight-forward (NOTE: NON-WORKING CODE):
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ };
>
> But since there isn't an adverb like this, what else might we do?
> The best idea I can come up with is this:
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $ };
>
> Where the third element of the conjunction should match only if
> none of the characters follow "ea".  There's an oddity here
> though in that I think this can get confused by things like an
> "ea" that *precedes* the conjunction.
>
> So, the question then is: is there a neater way to embed a
> subpattern in a regex that does a negated match?


My two cents: Here's that Github issue Yary opened on creating custom
character classes. Some ideas expressed may be useful for figuring out
how to do a "negative regex":

https://github.com/Raku/problem-solving/issues/97

One take-home message that I've ascertained from the github discussion
above is that Raku/Perl6 regexes are run "wide-open" across all
umpteen Unicode hyperplanes (blocks/scripts), i.e. for a 'positive'
regex, matches will only be limited by the source document you're
scanning against. So the match below against Bengali digits works
right out of the box (REPL below):

> say $/ if '০১২৩৪৫৬৭৮৯' ~~ / \d+ /;
｢০১২৩৪৫৬৭৮৯｣
>

Negative regexes are more problematic: with the example above, do you
really want to return all "non-digit" characters? Or only alphanumeric
characters without the numeric component? Or only English ('Latin')
alphanumeric characters without the numeric component?

> say $/ if '০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /;
()
> say $/ if 'অআই ০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /;
｢অআই ｣
>

Above the Bengali letters "a", "aa", and "i" are (correctly) returned
by the second regex test. However in the contrived example below,
where I intersperse Latin letters with Bengali numbers ('a০b১c২') and
wrap the construct in angle brackets, I see a match of
"Latin-non-digits" adjacent to "non-Latin-digits". This says to me
that the call to (filter of) "<:Script<Latin>" does NOT distribute
over the two different digit requirements, "-[\d]" and "+[\d]".
Moreover, I don't know HOW to get it to distribute over the two
different digit requirements, even if I "expand" the regex:

> say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]+[\d]>+ /;
｢a০b১c২｣
> say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]> <:Script<Latin>+[\d]> /;
｢a০｣
>

Anyway Joseph, I like your point about using a conjunction and/or
adverb to do a reverse match. I just: 1) wanted to expand the
conversation to Unicode scripts/blocks, and 2) I posted the contrived
("interspersed") examples above in the hope that someone will reply on
the list to explain what I'm doing wrong (because I don't know how to
subset the second half of the regex to only return **Latin digits**).

Best Regards, Bill.

Re: Matching subpatterns in any order, conjunctions, negated matches

Reply via email to