Re: [Assp-test] RegEx Backreferences - the basics

K Post Fri, 05 Nov 2021 12:05:33 -0700

Now you've taken up the entirety of any free time I would have had this
coming weekend so I fully dissect your (I'm sure vastly) improved regex!
Can't wait to learn from it, especially how you're using the negative
lookahead.


Already I'm learning from it, I didn't know you could do defines at all,
let alone in an ASSP config file.  That makes it SOOOO much more readable
and easier to write.

My questions at this point are more specific to email header matching and
ASSP.

1) You never have \r  as an option as far as I can tell.  In my testing of
my original regex, I found a sample email that seemingly wouldn't match my
expression unless the regex looked for \r?\n as  line ending.  If I just
had \n, no match.  It was only one email, but still.  Does that sound
possible? Might your sample need to have the optional \r added?

2) wouldn't what you have after the tld .+?\n make your expression match
"fake", instead of "real" in
FROM: "fakesen...@some.fake.com" <realsender@the.*real*.com>
My regex required the TLD immediately be followed by an optional >,
optional \r, and a \n  so that we can be sure that it's at the end of the
line.  I feel like editing

(?(DEFINE)(?<TLD>[a-z]{2,6}))

to

(?(DEFINE)(?<TLD>[a-z]{2,6}\>?\r?\n))

would give more accuracy.  Necessary? Is there a scenario with the to or
from wouldn't end with that?

3) Are you saying that the next version supporting "line continuation" is
the equivalent of the \s or DOTALL functionality or do you just mean that
we can continue a regex on the next line for readibility?

3a) if by line continuation, you mean dotall,  that means that your .+?
would also match newlines right??  And also match *blank* newlines, which
would be invalid, but still... We wouldn't want to match
to: whate...@domain.com
another: line
         <--- blank line, just a \n
from: somet...@whatever.com

3b) If line continuation means allowing the config file to continue onto
the next line, I don't think your regex accounts for other header lines
between the to and from (or from and to)?   Add in a (.+?\n)*? part just
before the negative lookahead to match:

to: whate...@domain.com
someother: header
more: headers
from: e...@else.com

or is there something else going on that accounts for those lines in
between?



I clearly have a lot to learn about the negative lookahead, and I'm eager
to do so on my own.   I'm going to do this on my own, but with what I
understand so far, am I correct to say that

\n(?!\k<tag>)(?:to:(?&ANYCHR)\k<FromMatch>\@(?:(?&
HOSTorNAME)\.)+(?&TLD)|from:(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)
*?\k<ToMatch>\.(?&TLD))

is \n last character is newline
then it's acting like if then else logic.  Simplified with comments

\n [last character is newline]
if [TAG] is not next  (?!\k<tag>)
     then

look for a line starting with To that has the 2nd level domain name in the
user part

(?:to:(?&ANYCHR)\k<FromMatch>\@(?:(?&HOSTorNAME)\.)+(?&TLD)

| else

look for a line starting with from that was the 2nd level domain name in
the hostname part
from:(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)*?\k<ToMatch>\.(?&TLD)

)  // end if


Tag will either be To or From depending on which it was set to earlier.
Yes?  If so, this is such powerful logic, so simple, yet clearly only comes
with quite a lot of regex experience!!



And thanks for the upcoming bug fix too.

Have a wonderful weekend.  Thanks for again burning brainpower this week
for me and for sharing all of your thoughts.
Ken


On Fri, Nov 5, 2021 at 9:09 AM Thomas Eckardt <thomas.ecka...@thockar.com>
wrote:

> >*It seems like using <<< >>> to turn of regex optimization might break
> the 2nd parameter from being recognized.*
>
> That's true. It is fixed in the next release.
>
>
> regex:
>
> something like this is better to read after some time, it is much less
> greedy, faster and selfexplaining:
>
>
> (?(DEFINE)(?<TLD>[a-z]{2,6}))(?(DEFINE)(?<HOSTorNAME>[a-z\d\-]+))(?(DEFINE)(?<ANYCHR>[^\n]*?))
>
> (?:^|\n)(?:(?<tag>to):(?&ANYCHR)(?<ToMatch>(?&HOSTorNAME))\@(?:(?&HOSTorNAME)\.)+(?&TLD)|(?<tag>from):(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)*?(?<FromMatch>(?&HOSTorNAME))\.(?&TLD))
>
> .+?\n(?!\k<tag>)(?:to:(?&ANYCHR)\k<FromMatch>\@(?:(?&HOSTorNAME)\.)+(?&TLD)|from:(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)*?\k<ToMatch>\.(?&TLD))
>
>
> (all in one line)
> the next release supports line continuation in files
>
> The regex is as simple as it can be, except one small trick - the negative
> lookahead (?!\k<tag>). So, yes - looking around the string without moving
> the position around makes some things more easy.
>
>
> This thread should be stopped here. This is a test list for development
> versions - it is not a blog and it is not a place to learn perl regular
> expressions.
>
> Thomas
>
>
>
>
>
> Von:        "K Post" <nntp.p...@gmail.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        05.11.2021 00:20
> Betreff:        Re: [Assp-test] RegEx Backreferences - the basics
> ------------------------------
>
>
>
> First of all, to say that the problem is sitting in front of the monitor
> is *insulting my keyboard*.  He sits between me and the monitor and has
> done nothing wrong.  The problem is actually my favorite phrase: PEBKAC.
> The* "Problem exists between keyboard and chair."*
>
> I believe that I have this working, using <<<  >>> to turn off the
> optimizer for the single line.  But it can surely use some review. More on
> that later.
>
>
> I've re-read the PCRE documentation at
> *https://www.pcre.org/original/doc/html/index.html*
> <https://www.pcre.org/original/doc/html/index.html>, focusing for now on
> the pcrepattern info.  I see at least some of the errors in my test script.
> I still don't know why the \1 isn't working, but I' ve moved on to the
> (previously unknown to me ability for the)  use of named references.
>
> You're going to laugh hysterically at me when you realize what I'm really
> trying to accomplish and how much more complicated it is than the silly
> test. But I think I'm getting close.  This has been a very good lesson, and
> I will hopefully be beneficial to others by example.
>
> Ultimately, I need to match (and give a negative score for the match to
> help let that message get through) any email where the username portion of
> the TO field matches the second level domain name (1 to the left of the
> TLD) of the FROM address.  That's clearly a far more complicated query.
>
>
> *Why in the world would I want to match such a thing?*
> They've had me set up a handful of wildcard subdomains.  Essentially any
> message sent to <anything>@*zackary.ourcharity.org*
> <http://zackary.ourcharity.org/> for example will actually go to the
> *zack...@ourcharity.org* <zack...@ourcharity.org> mailbox.  That works
> fine.  There's about 20 people with their own subdomain.
>
> For tracking purposes for one of our projects, Zackary and other staff are
> having people who are part of organizations that we help email him by using
> *theirsecondleveldomainn...@zackary.ourcharity.org*
> <theirsecondleveldomainn...@zackary.ourcharity.org>.   That messages
> usually would come from outside.per...@theirsecondleveldomainname.org or
> *outside.per...@subdomain.theirsecondleveldomainname.org*
> <outside.per...@subdomain.theirsecondleveldomainname.org>, but could also
> come from a person's personal email like gmail/Outlook,
>
> The program people can then search/sort by TO and gather all of the
> messages related to that outside org.  Fine, it's a weird way of working,
> but it's what the powers that be decided on.  This helps with reporting and
> helps us to get the funding to continue helping these people.  (every once
> and a while I'm reminded that the IT work I do actually is for a good cause
> and really does help people, despite my frustrations of a silly low IT
> budget).
>
> The problem is that I'll never know all of the organizations that they're
> giving out these addresses to, and it can be hundreds of different inbound
> addresses, so there's nothing I can do in advance.   They've been doing
> this for a while, and we're seeing outside compromised email accounts
> causing these organization-unique addresses to get on spam lists.
>
> SO, ultimately what I want to do is negatively score any email where the
> userpart of the to address matches the second level of the domain name.
> That won't help get the legitimate messages from people's personal gmail
> get through, but it should help us ensure that messages from each org
> that's sent to a matching to: address gets through, even if what they're
> talking about might fail bayesian tests.
>
>
> Before I show what I came up with, I've discovered an oddity.   in
> BombHeaderRe, testing the following:
> (MatchThisWord)=>-19
> will show a negative 19 score when a match is made in the analyze GUI, as
> expected
>
> However doing any of thse
> <<<(MatchThisWord)>>>=>-19
> <<<MatchThisWord>>>=>-19
> ~<<<MatchThisWord>>>~=>-19   <<-- which I tried because I originally had
> an or in there
> will show me a score of positive 25, the bomb valance I have set.
>
> *It seems like using <<< >>> to turn of regex optimization might break the
> 2nd parameter from being recognized.   Am I doing something very wrong or
> is this a bug?*
>
>
>
> And now for the regex I've come up with as a new starting point:
>
>
> *DISCLAIMER TO ANYONE READING THIS IN THE FUTURE* - while this seems to
> work for me, it's surely at least imperfect if not horribly inefficient or
> even wrong or broken!!!
>
> Here's the regex I've built.  It seems to work in ASSP and  test properly
> at *https://regextr.com* <https://regextr.com/> with PCRE selected as the
> engine and the case insensitive flag ticked
>
>
> (?:^|\r?\n)(?:to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n|from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)
>
> This appears to match:
>
> x-whatever: bla bla
> to: "my name" <*ThirdParty2Level-Domain*@OurCharity.org>
> subject: testing
> from: "them" <whatever.e...@bla.bla.*ThirdParty2Level-Domain*.them>
> asdf
> and with the from appearing before the to.
>
> I do not know of a way to make the order of to and from insignificant, so
> I've done an "or" in between the first part of the regex which looks for to
> then from and the second part which looks for from then to.   *Would it
> be more efficient for ASSP to have 2 separate lines, one for to first the
> other for from first?*
>
> Here's my thinking and explanation of my understanding of the regex that I
> wrote. I am VERY interested in corrections and suggestions for improvement,
> especially relating to efficiency (and obviously flawed logic and/or cases
> where what I've done would or wouldn't match as I'm thinking).  Guidance
> here won't only help me perfect this specific regex for ASSP use, but will
> hopefully help others looking for other more complex than typical regex
> help with ASSP.  I'll definitely be limiting the to domains to those that
> we use here to speed this up a bit, but I kept it more generic here.
>
> I also tried to see a way where lookaheads might help, but I'm not quite
> there yet....  Would they be helpful here?
>
> Starting from the beginning:
>
> (?:^|\r?\n)
> start with either the start of the string or a \r?\n   - sometimes there's
> a \r but always a \n     Is \r?\n recommended?  Is there a better way?
>
> Then we're going to do 2 big OR's,  first looking for to then from, then
> from then to.
> (?: starts this big or, with the ?: indicating that it's a non-capturing
> group
>
> The TO then From part is this:
>
> to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n
>
> broken out
>
> to:  Find to:  immediately after the previously found newline or start of
> string)
>
> (?:.*?[\s\<])*?
> non-capturing match for any characters repeated as long as they end with a
> space or <
>
> now we should be at the point where the username starts
>
> (?<TheMatch>[a-z\d\-]+)\@
> get a named match called TOFirstMatch for any a-z number - combination
> that ends in the now escaped @
>
> (?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n
> then just make sure that what follows the @ is a-z decimal and dahes, each
> part ending in a . with a 2-6 letter TLD ending the hostname followed by an
> optional > and then \n or \r to end the line
>
>
>
> (?:.+\r?\n)*?
> then ignore future lines which aren't blank until we a line starting with
> from:
>
> from:.*?
> line stars with from: followed by any characters
>
> \@(?:[a-z\d\-]+\.)*?
> find @valid.sub. part of from address
>
> \g{TOFirstMatch}
> use the \g{} syntax to match the named backreference
>
> \.[a-z]{2,6}\>?\r?\n)
> immediately followed by .tld 2-6 characters in length, an optional > and a
> \n or \r
>
> |
> then an OR
>
> and we do the whole thing again but with From First
>
> from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)
>
> from:.*?
> from: followed by anything until we hit
>
> \@(?:[a-z\d\-]+\.)*?
> and @ sign followed by any number of hostname followed by .
>
> (?<FROMFirstMatch>[a-z\d\-]+)
> find the second level domain name and call is FROMFirstMatch
>
> \.[a-z]{2,6}\>?\r?\n
> followed by a .tld of 2 to 6 characters, an optional closing > and a \n or
> \r
>
> (?:.+\r?\n)*?
> move past non blank lines until we hit
>
> to:(?:.*?[\s\<])*?
> to: optionally followed by whatever characters ending in space or <
>
>
> \g{FROMFirstMatch}\@
> now look for the second level domain match from the from: line immediately
> followed by an @ sign
>
> (?:[a-z\d\-]+\.)+
> then hostnames separated by dots, at least 1
>
> [a-z]{2,6}\>?\r?\n)
> followed by a 2-6 character tld, an optional > and a \n or \r?
>
> )
> closing out the or between the MatchToFirst and FROMFirstMatch sections.
>
>
> Whew.
> :
>
>
> On Thu, Nov 4, 2021 at 4:53 AM Thomas Eckardt <
> *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote:
> forgot to say:
>
> if assp requires to capture the match for a regex, the code would be for
> example
>
> $string =~ /($testReRE)/
> $match = $1;
>
> so - at runtime the regex is
>
> ((?^u:(?is:(?:^|\n\r).*(searchstring).*@.*\1.*)))
>
> IMHO you need to use named capture groups or \g or (?|
>
> Thomas
>
>
>
> Von:        "Thomas Eckardt" <*thomas.ecka...@thockar.com*
> <thomas.ecka...@thockar.com>>
> An:        "ASSP development mailing list" <
> *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>>
> Datum:        04.11.2021 09:22
> Betreff:        Re: [Assp-test] RegEx Backreferences - the basics
> ------------------------------
>
>
>
> to make backreferences working, regex optimization must be switched off
> for the complete regex -> tested -> worked
>
> >I've seen posts here indicating that backreferencing matches is possible
> with an unoptimized expression.
>
> so - the problem is sitting in front of the monitor :):)
>
> m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/
>
> optimized - default is : 'no extra group capturing is allowed'
>
> >I've got to be missing something incredibly obvious.
>
> assp-do-not-optimize-regex
>
> >  (?:^|\n\r).*(searchstring).*@.*\1.*
>
> assp makes it:
>
> (?is:(?:^|\n\r).*(searchstring).*@.*\1.*)
>
> think about your regex - read it from left to right as 'perl regex engine'
> - what will happen?
> beside the other mistakes the @ should be escaped  \@ , because an ARRAY
> @. may exist
>
> >Regex101.com seems to confirm that this works.
>
> does not check perl pcre
>
> and if I read the explanation there, I sure it will not work like you
> expect
>
>
> Thomas
>
>
>
> Von:        "K Post" <*nntp.p...@gmail.com* <nntp.p...@gmail.com>>
> An:        "ASSP development mailing list" <
> *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>>
> Datum:        04.11.2021 02:29
> Betreff:        [Assp-test] RegEx Backreferences - the basics
> ------------------------------
>
>
>
> I've got nothing in my TestRe file except for a single line:
>
> ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~
>
> The idea is to log any time there's a line that includes "searchstring" on
> the right and left of an @.  This is just a very rudimentary test because
> backreferences seem to error for me.  I would expect this to match
> searchstring@searchstring
> something else seachstring more @ whatever searchstring bla
> If "searchstring" is to the right and left of an @ sign, it should match.
> Regex101.com seems to confirm that this works.  Like I said, super basic.
>
> However, if I enter ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ as the
> only line in TestRe file, I get a warning in the log:
>
> - Reference to nonexistent group in regex; marked by <-- HERE in
> m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/
> - try using unoptimized regex
>
> To my understanding, the <<< >>> surround should turn of regex
> optimization for that line, which enables backreferencing (\1) to work and
> the ~ is required because there's an or in there.   Shouldn't the \1
> reference (searchstring) ?  I don't understand why assp thinks that \1 is a
> reference to a non-existent group.
>
> I also tried removing the <<< >>> and adding assp-do-not-optimize to the
> top of the TestRe file.  No difference.    No matter how simple I make the
> regex, even (.*)@\1,  it still complains about the invalid backreference.
>
>
> I've got to be missing something incredibly obvious.  I've read through
> the regex doc in docs, but that doesn't talk about backreferencing in ASSP
> and I can't find anything in the GUI that makes mention. I've seen posts
> here indicating that backreferencing matches is possible with an
> unoptimized expression.
>
> A shove in the right direction would be greatly appreciated.
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Re: [Assp-test] RegEx Backreferences - the basics

Reply via email to