Re: Matching subpatterns in any order, conjunctions, negated matches

Joseph Brenner Sat, 16 May 2020 17:49:24 -0700

This is pretty interesting, though I think you're talking about a
different subject... I was talking about cases where the
sub-patterns are more than one character long.  It's true that if
you were interested in single-character sub-patterns, then you
could get close with character classes and negated character
classes.  For example, suppose you were not only interested in
matching the word "the" but you were also interested in finding
it even if it had typos like "teh" or "hte" or something.  Then a
pattern like /<[the]>**3/ would definitely find every permutation
of "t", "h", and "e", and it could be it's good enough for you,
butit would also match cases with duplications like "ttt", "hhh",
"thh" and so on, so it's a broader match than the techniques I
was talking about (it gets the combinations, not just the
permutations).



On 5/16/20, William Michels <w...@caa.columbia.edu> wrote:
> On Fri, May 15, 2020 at 7:33 PM Joseph Brenner <doom...@gmail.com> wrote:
>>
>> Regex engines by their nature care a lot about order, but I
>> occasionally want to relax that to match for multiple
>> multicharacter subpatterns where the order of them doesn't
>> matter.
>>
>> Frequently the simplest thing to do is just to just do multiple
>> matches.   Let's say you're looking for words that have a "qu" a
>> "th" and also, say an "ea".  This works:
>>
>>   my $DICT  = "/usr/share/dict/american-english";
>>   my @hits = $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>>   say @hits;
>>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
>> earthquakes]
>>
>>
>> It could be useful to be able to do it as one match though, for
>> example, you might be using someone else's routine which takes a
>> single regex as argument.  I've been known to write things like
>> this:
>>
>>   my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
>>                          [ qu .*? ea .*? th ] |
>>                          [ th .*? qu .*? ea ] |
>>                          [ th .*? ea .*? qu ] |
>>                          [ ea .*? th .*? qu ] |
>>                          [ ea .*? qu .*? th ]  };
>>   my @hits = $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});
>>
>> That works, but it gets unwieldy quickly if you need to scale up
>> the number of subpatterns.
>>
>> Recently though, I noticed the "conjunctions" feature, and it
>> occured to me that this could be a very neat way of handling
>> these things:
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };
>>
>> That's certainly much better, though unfortunately each element
>> of the conjunction needs to match a substring of the same length,
>> so pretty frequently you're stuck with the visual noise of
>> bracketing subpatterns with pairs of .*
>>
>> Where things get interesting is when you want a negated match of
>> one of the subpatterns.  One of the things I like about the first
>> approach using multiple chained greps is that it's easy to do a
>> reverse match.  What if you want words with "qu" and "th" but
>> want to *skip* ones with an "ea"?
>>
>>   my @hits = $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>>   # [Asquith discotheque discotheque's discotheques quoth]
>>
>> To do that in one regex, it would be nice if there were some sort
>> of adverb to do a reverse match, like say :not, then it
>> would be straight-forward (NOTE: NON-WORKING CODE):
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ };
>>
>> But since there isn't an adverb like this, what else might we do?
>> The best idea I can come up with is this:
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $
>> };
>>
>> Where the third element of the conjunction should match only if
>> none of the characters follow "ea".  There's an oddity here
>> though in that I think this can get confused by things like an
>> "ea" that *precedes* the conjunction.
>>
>> So, the question then is: is there a neater way to embed a
>> subpattern in a regex that does a negated match?
>
> My two cents: Here's that Github issue Yary opened on creating custom
> character classes. Some ideas expressed may be useful for figuring out
> how to do a "negative regex":
>
> https://github.com/Raku/problem-solving/issues/97
>
> One take-home message that I've ascertained from the github discussion
> above is that Raku/Perl6 regexes are run "wide-open" across all
> umpteen Unicode hyperplanes (blocks/scripts), i.e. for a 'positive'
> regex, matches will only be limited by the source document you're
> scanning against. So the match below against Bengali digits works
> right out of the box (REPL below):
>
>> say $/ if '০১২৩৪৫৬৭৮৯' ~~ / \d+ /;
> ｢০১২৩৪৫৬৭৮৯｣
>>
>
> Negative regexes are more problematic: with the example above, do you
> really want to return all "non-digit" characters? Or only alphanumeric
> characters without the numeric component? Or only English ('Latin')
> alphanumeric characters without the numeric component?
>
>> say $/ if '০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /;
> ()
>> say $/ if 'অআই ০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /;
> ｢অআই ｣
>>
>
> Above the Bengali letters "a", "aa", and "i" are (correctly) returned
> by the second regex test. However in the contrived example below,
> where I intersperse Latin letters with Bengali numbers ('a০b১c২') and
> wrap the construct in angle brackets, I see a match of
> "Latin-non-digits" adjacent to "non-Latin-digits". This says to me
> that the call to (filter of) "<:Script<Latin>" does NOT distribute
> over the two different digit requirements, "-[\d]" and "+[\d]".
> Moreover, I don't know HOW to get it to distribute over the two
> different digit requirements, even if I "expand" the regex:
>
>> say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]+[\d]>+ /;
> ｢a০b১c২｣
>> say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]> <:Script<Latin>+[\d]> /;
> ｢a০｣
>>
>
> Anyway Joseph, I like your point about using a conjunction and/or
> adverb to do a reverse match. I just: 1) wanted to expand the
> conversation to Unicode scripts/blocks, and 2) I posted the contrived
> ("interspersed") examples above in the hope that someone will reply on
> the list to explain what I'm doing wrong (because I don't know how to
> subset the second half of the regex to only return **Latin digits**).
>
> Best Regards, Bill.
>

Re: Matching subpatterns in any order, conjunctions, negated matches

Reply via email to