Re: [Assp-test] RegEx Backreferences - the basics

Thomas Eckardt Fri, 05 Nov 2021 06:09:33 -0700

>It seems like using <<< >>> to turn of regex optimization might break the 
2nd parameter from being recognized.

That's true. It is fixed in the next release.

regex:

something like this is better to read after some time, it is much less 
greedy, faster and selfexplaining:

(?(DEFINE)(?<TLD>[a-z]{2,6}))(?(DEFINE)(?<HOSTorNAME>[a-z\d\-]+))(?(DEFINE)(?<ANYCHR>[^\n]*?))
(?:^|\n)(?:(?<tag>to):(?&ANYCHR)(?<ToMatch>(?&HOSTorNAME))\@(?:(?&HOSTorNAME)\.)+(?&TLD)|(?<tag>from):(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)*?(?<FromMatch>(?&HOSTorNAME))\.(?&TLD))
.+?\n(?!\k<tag>)(?:to:(?&ANYCHR)\k<FromMatch>\@(?:(?&HOSTorNAME)\.)+(?&TLD)|from:(?&ANYCHR)\@(?:(?&HOSTorNAME)\.)*?\k<ToMatch>\.(?&TLD))

(all in one line)
the next release supports line continuation in files

The regex is as simple as it can be, except one small trick - the negative 
lookahead (?!\k<tag>). So, yes - looking around the string without moving 
the position around makes some things more easy.

This thread should be stopped here. This is a test list for development 
versions - it is not a blog and it is not a place to learn perl regular 
expressions.

Thomas

Von:    "K Post" <nntp.p...@gmail.com>
An:     "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum:  05.11.2021 00:20
Betreff:        Re: [Assp-test] RegEx Backreferences - the basics

First of all, to say that the problem is sitting in front of the monitor 
is insulting my keyboard.  He sits between me and the monitor and has done 
nothing wrong.  The problem is actually my favorite phrase: PEBKAC.  The 
"Problem exists between keyboard and chair."

I believe that I have this working, using <<<  >>> to turn off the 
optimizer for the single line.  But it can surely use some review. More on 
that later.

I've re-read the PCRE documentation at 
https://www.pcre.org/original/doc/html/index.html, focusing for now on the 
pcrepattern info.  I see at least some of the errors in my test script. I 
still don't know why the \1 isn't working, but I' ve moved on to the 
(previously unknown to me ability for the)  use of named references.

You're going to laugh hysterically at me when you realize what I'm really 
trying to accomplish and how much more complicated it is than the silly 
test. But I think I'm getting close.  This has been a very good lesson, 
and I will hopefully be beneficial to others by example.

Ultimately, I need to match (and give a negative score for the match to 
help let that message get through) any email where the username portion of 
the TO field matches the second level domain name (1 to the left of the 
TLD) of the FROM address.  That's clearly a far more complicated query.

Why in the world would I want to match such a thing?
They've had me set up a handful of wildcard subdomains.  Essentially any 
message sent to <anything>@zackary.ourcharity.org for example will 
actually go to the zack...@ourcharity.org mailbox.  That works fine.  
There's about 20 people with their own subdomain.

For tracking purposes for one of our projects, Zackary and other staff are 
having people who are part of organizations that we help email him by 
using theirsecondleveldomainn...@zackary.ourcharity.org.   That messages 
usually would come from outside.per...@theirsecondleveldomainname.org or 
outside.per...@subdomain.theirsecondleveldomainname.org, but could also 
come from a person's personal email like gmail/Outlook,

The program people can then search/sort by TO and gather all of the 
messages related to that outside org.  Fine, it's a weird way of working, 
but it's what the powers that be decided on.  This helps with reporting 
and helps us to get the funding to continue helping these people.  (every 
once and a while I'm reminded that the IT work I do actually is for a good 
cause and really does help people, despite my frustrations of a silly low 
IT budget).

The problem is that I'll never know all of the organizations that they're 
giving out these addresses to, and it can be hundreds of different inbound 
addresses, so there's nothing I can do in advance.   They've been doing 
this for a while, and we're seeing outside compromised email accounts 
causing these organization-unique addresses to get on spam lists.

SO, ultimately what I want to do is negatively score any email where the 
userpart of the to address matches the second level of the domain name.  
That won't help get the legitimate messages from people's personal gmail 
get through, but it should help us ensure that messages from each org 
that's sent to a matching to: address gets through, even if what they're 
talking about might fail bayesian tests. 

Before I show what I came up with, I've discovered an oddity.   in 
BombHeaderRe, testing the following:
(MatchThisWord)=>-19
will show a negative 19 score when a match is made in the analyze GUI, as 
expected

However doing any of thse
<<<(MatchThisWord)>>>=>-19
<<<MatchThisWord>>>=>-19
~<<<MatchThisWord>>>~=>-19   <<-- which I tried because I originally had 
an or in there
will show me a score of positive 25, the bomb valance I have set.  

It seems like using <<< >>> to turn of regex optimization might break the 
2nd parameter from being recognized.   Am I doing something very wrong or 
is this a bug?

And now for the regex I've come up with as a new starting point:

DISCLAIMER TO ANYONE READING THIS IN THE FUTURE - while this seems to work 
for me, it's surely at least imperfect if not horribly inefficient or even 
wrong or broken!!!

Here's the regex I've built.  It seems to work in ASSP and  test properly 
at https://regextr.com with PCRE selected as the engine and the case 
insensitive flag ticked

(?:^|\r?\n)(?:to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n|from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)

This appears to match:

x-whatever: bla bla
to: "my name" <thirdparty2level-dom...@ourcharity.org>
subject: testing
from: "them" <whatever.e...@bla.bla.thirdparty2level-domain.them>
asdf
and with the from appearing before the to.

I do not know of a way to make the order of to and from insignificant, so 
I've done an "or" in between the first part of the regex which looks for 
to then from and the second part which looks for from then to.   Would it 
be more efficient for ASSP to have 2 separate lines, one for to first the 
other for from first?

Here's my thinking and explanation of my understanding of the regex that I 
wrote. I am VERY interested in corrections and suggestions for 
improvement, especially relating to efficiency (and obviously flawed logic 
and/or cases where what I've done would or wouldn't match as I'm 
thinking).  Guidance here won't only help me perfect this specific regex 
for ASSP use, but will hopefully help others looking for other more 
complex than typical regex help with ASSP.  I'll definitely be limiting 
the to domains to those that we use here to speed this up a bit, but I 
kept it more generic here.

I also tried to see a way where lookaheads might help, but I'm not quite 
there yet....  Would they be helpful here?

Starting from the beginning:

(?:^|\r?\n)
start with either the start of the string or a \r?\n   - sometimes there's 
a \r but always a \n     Is \r?\n recommended?  Is there a better way? 

Then we're going to do 2 big OR's,  first looking for to then from, then 
from then to.
(?: starts this big or, with the ?: indicating that it's a non-capturing 
group

The TO then From part is this:
to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n

broken out

to:  Find to:  immediately after the previously found newline or start of 
string)

(?:.*?[\s\<])*?
non-capturing match for any characters repeated as long as they end with a 
space or <

now we should be at the point where the username starts

(?<TheMatch>[a-z\d\-]+)\@
get a named match called TOFirstMatch for any a-z number - combination 
that ends in the now escaped @

(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n
then just make sure that what follows the @ is a-z decimal and dahes, each 
part ending in a . with a 2-6 letter TLD ending the hostname followed by 
an optional > and then \n or \r to end the line

(?:.+\r?\n)*?
then ignore future lines which aren't blank until we a line starting with 
from:

from:.*?
line stars with from: followed by any characters

\@(?:[a-z\d\-]+\.)*?
find @valid.sub. part of from address

\g{TOFirstMatch}
use the \g{} syntax to match the named backreference

\.[a-z]{2,6}\>?\r?\n)
immediately followed by .tld 2-6 characters in length, an optional > and a 
\n or \r

|
then an OR

and we do the whole thing again but with From First
from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)

from:.*?
from: followed by anything until we hit

\@(?:[a-z\d\-]+\.)*?
and @ sign followed by any number of hostname followed by .

(?<FROMFirstMatch>[a-z\d\-]+)
find the second level domain name and call is FROMFirstMatch

\.[a-z]{2,6}\>?\r?\n
followed by a .tld of 2 to 6 characters, an optional closing > and a \n or 
\r

(?:.+\r?\n)*?
move past non blank lines until we hit

to:(?:.*?[\s\<])*?
to: optionally followed by whatever characters ending in space or <

\g{FROMFirstMatch}\@
now look for the second level domain match from the from: line immediately 
followed by an @ sign

(?:[a-z\d\-]+\.)+
then hostnames separated by dots, at least 1

[a-z]{2,6}\>?\r?\n)
followed by a 2-6 character tld, an optional > and a \n or \r?

)
closing out the or between the MatchToFirst and FROMFirstMatch sections.

Whew.
:

On Thu, Nov 4, 2021 at 4:53 AM Thomas Eckardt <thomas.ecka...@thockar.com> 
wrote:
forgot to say: 

if assp requires to capture the match for a regex, the code would be for 
example 

$string =~ /($testReRE)/ 
$match = $1;

so - at runtime the regex is 

((?^u:(?is:(?:^|\n\r).*(searchstring).*@.*\1.*))) 

IMHO you need to use named capture groups or \g or (?| 

Thomas 

Von:        "Thomas Eckardt" <thomas.ecka...@thockar.com> 
An:        "ASSP development mailing list" <
assp-test@lists.sourceforge.net> 
Datum:        04.11.2021 09:22 
Betreff:        Re: [Assp-test] RegEx Backreferences - the basics 

to make backreferences working, regex optimization must be switched off 
for the complete regex -> tested -> worked 

>I've seen posts here indicating that backreferencing matches is possible 
with an unoptimized expression. 

so - the problem is sitting in front of the monitor :):) 

m/(?is:(?:^|\n\r).*(?:searchstring).*@.*\1 <-- HERE .*)/  

optimized - default is : 'no extra group capturing is allowed' 

>I've got to be missing something incredibly obvious. 

assp-do-not-optimize-regex

>  (?:^|\n\r).*(searchstring).*@.*\1.* 

assp makes it: 

(?is:(?:^|\n\r).*(searchstring).*@.*\1.*) 

think about your regex - read it from left to right as 'perl regex engine' 
- what will happen? 
beside the other mistakes the @ should be escaped  \@ , because an ARRAY 
@. may exist 

>Regex101.com seems to confirm that this works. 

does not check perl pcre 

and if I read the explanation there, I sure it will not work like you 
expect 

Thomas 

Von:        "K Post" <nntp.p...@gmail.com> 
An:        "ASSP development mailing list" <
assp-test@lists.sourceforge.net> 
Datum:        04.11.2021 02:29 
Betreff:        [Assp-test] RegEx Backreferences - the basics 

I've got nothing in my TestRe file except for a single line: 

~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ 

The idea is to log any time there's a line that includes "searchstring" on 
the right and left of an @.  This is just a very rudimentary test because 
backreferences seem to error for me.  I would expect this to match 
searchstring@searchstring 
something else seachstring more @ whatever searchstring bla 
If "searchstring" is to the right and left of an @ sign, it should match.  
Regex101.com seems to confirm that this works.  Like I said, super basic. 

However, if I enter ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ as the 
only line in TestRe file, I get a warning in the log: 

- Reference to nonexistent group in regex; marked by <-- HERE in 
m/(?is:(?:^|\n\r).*(?:searchstring).*@.*\1 <-- HERE .*)/ 
- try using unoptimized regex 

To my understanding, the <<< >>> surround should turn of regex 
optimization for that line, which enables backreferencing (\1) to work and 
the ~ is required because there's an or in there.   Shouldn't the \1 
reference (searchstring) ?  I don't understand why assp thinks that \1 is 
a reference to a non-existent group. 

I also tried removing the <<< >>> and adding assp-do-not-optimize to the 
top of the TestRe file.  No difference.    No matter how simple I make the 
regex, even (.*)@\1,  it still complains about the invalid backreference. 

I've got to be missing something incredibly obvious.  I've read through 
the regex doc in docs, but that doesn't talk about backreferencing in ASSP 
and I can't find anything in the GUI that makes mention. I've seen posts 
here indicating that backreferencing matches is possible with an 
unoptimized expression. 

A shove in the right direction would be greatly appreciated. 
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Re: [Assp-test] RegEx Backreferences - the basics

Reply via email to