On Sun, 2 Feb 2025, Jimmy wrote:
dbg: uri: Not match:
text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}
not matches the
pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\})
with operator:=~
Okay, I finally had some time to sit down and poke at this rather than
just give a quick off-the-cuff shot in the dark.
I think that the uri_detail plugin is broken w/r/t matching explicit bytes
in the anchor text using the \x00 notation, but it's a little beyond my
Perl skills and familiarity with the code base to completely analyze. The
problem may affect more than just uri_detail anchor text rules.
The doubled backslashes are just logging behavior, they are not indicators
of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte
for logging. The regex should not attempt to match *that* exactly.
Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which
contains "\s" successfully matches:
dbg: uri: text matched: 'visit here to opt out or write a letter to the
address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/
...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes:
dbg: uri: text matched: 'visit here to opt out or write a letter to the
address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/
...but \x00 notation for non-ASCII data does not:
uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/
uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
=~ /(?^aa:\\xe0)/
(does not match)
Explicit Unicode hex in regexes does work *outside* the uri_detail anchor
text context:
body __UNICODE_BODY
/\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/
dbg: rules: ran body rule __UNICODE_BODY ======> got hit:
"\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}"
...so you might want to just do a regular body rule for the anchor text
until this gets fixed.
Can anyone explain what the "(?^aa:" in the uri_detail regex means? I
suspect that's extremely relevant but I couldn't find anything online that
explains it. Maybe that specifies something related to character encoding
that's breaking the interpretation of the regex as a raw Unicode hex
string.
That appears to be added by Mail::SpamAssassin::Util compile_regexp(),
apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802
per
https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h
Maybe we need to be a little less eager to broadly apply /aa ?
No, I took that out and it's still not hitting on the Unicode hex pattern:
dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
'\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
=~ /(?^:\\xE0\\xB8\\x97)/
(does not match)
FWIW pasting the raw unicode character into the regex also does not work:
dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
'\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
=~ /(?^aa:\x{E0}\x{B8}\x{95})/
(does not match)
You should probably open a bug with your rule and attach the spample.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Rights can only ever be individual, which means that you cannot
gain a right by joining a mob, no matter how shiny the issued
badges are, or how many of your neighbors are part of it. -- Marko
-----------------------------------------------------------------------
10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays