On Fri, May 15, 2020 at 7:33 PM Joseph Brenner <doom...@gmail.com> wrote: > > Regex engines by their nature care a lot about order, but I > occasionally want to relax that to match for multiple > multicharacter subpatterns where the order of them doesn't > matter. > > Frequently the simplest thing to do is just to just do multiple > matches. Let's say you're looking for words that have a "qu" a > "th" and also, say an "ea". This works: > > my $DICT = "/usr/share/dict/american-english"; > my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({/ea/}); > say @hits; > # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's > earthquakes] > > > It could be useful to be able to do it as one match though, for > example, you might be using someone else's routine which takes a > single regex as argument. I've been known to write things like > this: > > my regex qu_th_ea { [ qu .*? th .*? ea ] | > [ qu .*? ea .*? th ] | > [ th .*? qu .*? ea ] | > [ th .*? ea .*? qu ] | > [ ea .*? th .*? qu ] | > [ ea .*? qu .*? th ] }; > my @hits = $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/}); > > That works, but it gets unwieldy quickly if you need to scale up > the number of subpatterns. > > Recently though, I noticed the "conjunctions" feature, and it > occured to me that this could be a very neat way of handling > these things: > > my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ }; > > That's certainly much better, though unfortunately each element > of the conjunction needs to match a substring of the same length, > so pretty frequently you're stuck with the visual noise of > bracketing subpatterns with pairs of .* > > Where things get interesting is when you want a negated match of > one of the subpatterns. One of the things I like about the first > approach using multiple chained greps is that it's easy to do a > reverse match. What if you want words with "qu" and "th" but > want to *skip* ones with an "ea"? > > my @hits = $DICT.IO.open( :r > ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/}); > # [Asquith discotheque discotheque's discotheques quoth] > > To do that in one regex, it would be nice if there were some sort > of adverb to do a reverse match, like say :not, then it > would be straight-forward (NOTE: NON-WORKING CODE): > > my regex qu_th_ea { ^ [ .* qu .* & .* th .* & [ :not .* ea .* ] ] $ }; > > But since there isn't an adverb like this, what else might we do? > The best idea I can come up with is this: > > my regex qu_th_ea { ^ [ .* qu .* & .* th .* & [ <!after ea> . ]* ] $ }; > > Where the third element of the conjunction should match only if > none of the characters follow "ea". There's an oddity here > though in that I think this can get confused by things like an > "ea" that *precedes* the conjunction. > > So, the question then is: is there a neater way to embed a > subpattern in a regex that does a negated match?
My two cents: Here's that Github issue Yary opened on creating custom character classes. Some ideas expressed may be useful for figuring out how to do a "negative regex": https://github.com/Raku/problem-solving/issues/97 One take-home message that I've ascertained from the github discussion above is that Raku/Perl6 regexes are run "wide-open" across all umpteen Unicode hyperplanes (blocks/scripts), i.e. for a 'positive' regex, matches will only be limited by the source document you're scanning against. So the match below against Bengali digits works right out of the box (REPL below): > say $/ if '০১২৩৪৫৬৭৮৯' ~~ / \d+ /; 「০১২৩৪৫৬৭৮৯」 > Negative regexes are more problematic: with the example above, do you really want to return all "non-digit" characters? Or only alphanumeric characters without the numeric component? Or only English ('Latin') alphanumeric characters without the numeric component? > say $/ if '০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /; () > say $/ if 'অআই ০১২৩৪৫৬৭৮৯' ~~ / <-[\d]>+ /; 「অআই 」 > Above the Bengali letters "a", "aa", and "i" are (correctly) returned by the second regex test. However in the contrived example below, where I intersperse Latin letters with Bengali numbers ('a০b১c২') and wrap the construct in angle brackets, I see a match of "Latin-non-digits" adjacent to "non-Latin-digits". This says to me that the call to (filter of) "<:Script<Latin>" does NOT distribute over the two different digit requirements, "-[\d]" and "+[\d]". Moreover, I don't know HOW to get it to distribute over the two different digit requirements, even if I "expand" the regex: > say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]+[\d]>+ /; 「a০b১c২」 > say $/ if 'a০b১c২' ~~ / <:Script<Latin>-[\d]> <:Script<Latin>+[\d]> /; 「a০」 > Anyway Joseph, I like your point about using a conjunction and/or adverb to do a reverse match. I just: 1) wanted to expand the conversation to Unicode scripts/blocks, and 2) I posted the contrived ("interspersed") examples above in the hope that someone will reply on the list to explain what I'm doing wrong (because I don't know how to subset the second half of the regex to only return **Latin digits**). Best Regards, Bill.