On Sun, 2 Feb 2025, Jimmy wrote:

dbg: uri: Not match:
text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}
not matches the
pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\})
with operator:=~

Okay, I finally had some time to sit down and poke at this rather than just give a quick off-the-cuff shot in the dark.

I think that the uri_detail plugin is broken w/r/t matching explicit bytes in the anchor text using the \x00 notation, but it's a little beyond my Perl skills and familiarity with the code base to completely analyze. The problem may affect more than just uri_detail anchor text rules.

The doubled backslashes are just logging behavior, they are not indicators of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte for logging. The regex should not attempt to match *that* exactly.

Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which contains "\s" successfully matches:

dbg: uri: text matched: 'visit here to opt out or write a letter to the address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/

...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes:

dbg: uri: text matched: 'visit here to opt out or write a letter to the address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/

...but \x00 notation for non-ASCII data does not:

uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/

uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' =~ /(?^aa:\\xe0)/

(does not match)


Explicit Unicode hex in regexes does work *outside* the uri_detail anchor text context:

body  __UNICODE_BODY 
/\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/

dbg: rules: ran body rule __UNICODE_BODY ======> got hit: 
"\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}"

...so you might want to just do a regular body rule for the anchor text until this gets fixed.


Can anyone explain what the "(?^aa:" in the uri_detail regex means? I suspect that's extremely relevant but I couldn't find anything online that explains it. Maybe that specifies something related to character encoding that's breaking the interpretation of the regex as a raw Unicode hex string.

That appears to be added by Mail::SpamAssassin::Util compile_regexp(), apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802 per https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h Maybe we need to be a little less eager to broadly apply /aa ?

No, I took that out and it's still not hitting on the Unicode hex pattern:

dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' =~ /(?^:\\xE0\\xB8\\x97)/

(does not match)

FWIW pasting the raw unicode character into the regex also does not work:

dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' =~ /(?^aa:\x{E0}\x{B8}\x{95})/

(does not match)



You should probably open a bug with your rule and attach the spample.




--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org                         pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Rights can only ever be individual, which means that you cannot
  gain a right by joining a mob, no matter how shiny the issued
  badges are, or how many of your neighbors are part of it.  -- Marko
-----------------------------------------------------------------------
 10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays

Reply via email to