Re: Matching subpatterns in any order, conjunctions, negated matches

Joseph Brenner Sun, 17 May 2020 11:41:24 -0700

Peter Pentchev wrote:

> Actually, there is, and I conveniently did not mention it :) It's the
> case when the patterns may overlap: if you do the '<?before' thing with
> 'the' and 'entrance', you might match 'thentrance', which, depending on
> your use case, might not be ideal.


That's a good point, but it's true that that it's a matter of
your actual use case.   Using your technique to look for words
with a "qu", "ue" and a "en":

  my regex qu_th_not_ea_3b_pos
     { ^ <?before .* "qu" > <?before .* "ue" > <before .* "en" > };

That matches an overlapping case like "queen", but that strikes me as
okay-- it might surprise someone, but it'd be a pretty minor surprise
(unless maybe if you were playing scrabble and trying to get rid
of all your letters...).

Also, things like my multipattern triple-grep approach would show
the same behavior.

The kind of edge case I was talking about was with things like my
conjunction approach, using a negated after to get a negative match:

  my regex qu_th_not_ea
    { ^ [ .* qu .* & .* th .* &  [ <!after ea>. ]* ] $ };

That works pretty well, but it would pass a string like "quothea"
in error, because that after is making sure that none of the
characters in the string follow a "ea", and when the "ea" is at
the end then there is nothing that follows.   Your idiom working
off of ^ is better, because every string has a beginning...

And of course, if you were matching for words in multiword text
then you'd just use that word boundary, so that's no problem.

(I was also worrying vaugely about some other things that don't
pan out, like what if there was an additional "qu" in front of
what you were trying to match and there was no way to subdivide
by lines or words or something-- but what could that possibly
mean?  In that case the "qu" is just part of the string and fair
game to match against, so...)



On 5/16/20, Peter Pentchev <r...@ringlet.net> wrote:
> On Sat, May 16, 2020 at 05:53:04PM -0700, Joseph Brenner wrote:
>>  Peter Pentchev <r...@ringlet.net> wrote:
>> > On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
>> >> Regex engines by their nature care a lot about order, but I
>> >> occasionally want to relax that to match for multiple
>> >> multicharacter subpatterns where the order of them doesn't
>> >> matter.
>> >>
>> >> Frequently the simplest thing to do is just to just do multiple
>> >> matches.   Let's say you're looking for words that have a "qu" a
>> >> "th" and also, say an "ea".  This works:
>> >>
>> >>   my $DICT  = "/usr/share/dict/american-english";
>> >>   my @hits = $DICT.IO.open( :r
>> >> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>> >>   say @hits;
>> >>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
>> >> earthquakes]
>> >
>> > Would something like this work for you?
>> >
>> >   /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /
>> >
>> >> Where things get interesting is when you want a negated match of
>> >> one of the subpatterns.  One of the things I like about the first
>> >> approach using multiple chained greps is that it's easy to do a
>> >> reverse match.  What if you want words with "qu" and "th" but
>> >> want to *skip* ones with an "ea"?
>> >>
>> >>   my @hits = $DICT.IO.open( :r
>> >> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>> >>   # [Asquith discotheque discotheque's discotheques quoth]
>> >
>> > Maybe something like this? (note the "!" instead of "?")
>> >
>> >   /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /
>> >
>>
>> Yes, both of those work, and arguably they're a little cleaner
>> looking than my conjunction approach-- though it's not necessarily any
>> easier to think about.  It looks like a pattern that's matching
>> for three things in order, but the zero-widthness of the "before"
>> let's them all work on top of each other.
>>
>> I keep thinking there's an edge case in these before/after tricks that
>> might matter if we weren't matching the one-word-per-line format of
>> the unix dictionaries, but I need to think about that a little more...
>
> Actually, there is, and I conveniently did not mention it :) It's the
> case when the patterns may overlap: if you do the '<?before' thing with
> 'the' and 'entrance', you might match 'thentrance', which, depending on
> your use case, might not be ideal.
>
> I've thought a little about another method: splitting the string using
> one of the patterns as a separator, then splitting each of the resulting
> substrings using the next one and so on until you get to the last one,
> where you check whether any of the ministrings contains it, but it would
> have to be done carefully, it would have to somehow be done with
> a special split-like function that would find all of the occurrences of
> the pattern and return tuples "before" and "after" to avoid another kind
> of problems with overlaps: if you split "the father" on all of
> the ocurrences of "the" at the same time, you *will* miss "father" :)
> So you need a special sort of split function that will split
> "the father" first as ("", " father"), then as ("the fa", "r"), and return
> all of the non-empty results (" father", "the fa", "r")... I'm not sure
> this will be very efficient. OK, so as a microoptimization it may return
> all of the results that are at least as long as the shortest pattern
> remaining, but it still sounds weird.
>
> G'luck,
> Peter
>
> --
> Peter Pentchev  r...@ringlet.net r...@debian.org p...@storpool.com
> PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
> Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13
>

Re: Matching subpatterns in any order, conjunctions, negated matches

Reply via email to