To address the issue of matching anchor text containing Unicode characters, I've implemented a new rule option called unicode_text. This option ensures that the anchor text is converted to Unicode before being compared against the rule's regular expression. As a result, the following rule now correctly matches the specified Thai characters:
uri_detail __TZ_PHISH_LINK_TH1 unicode_text =~ /\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}/ type =~ /^a$/ This change allows for accurate matching of Unicode characters within the uri_detail context. --- Wed Jan 22 16:01:38 2025 UTC @@ -151,7 +151,7 @@ my $pattern = $3; my $neg = 0; - if ($target !~ /^!?(?:raw|type|cleaned|text|domain|host)$/) { + if ($target !~ /^!?(?:raw|type|cleaned|text|unicode_text|domain|host)$/) { return $Mail::SpamAssassin::Conf::INVALID_VALUE; } @@ -271,6 +271,30 @@ } } + if (exists $rule->{unicode_text}) { + next unless $info->{anchor_text}; + my($op,$patt,$neg) = @{$rule->{unicode_text}}; + my $match; + for my $text (@{ $info->{anchor_text} }) { + use Encode qw(decode encode); + $text = encode("UTF-8", $text); + if ( ($op eq '=~' && $text =~ $patt) || + ($op eq '!~' && $text !~ $patt) ) { + dbg("uri: Match found: text:%s matches the pattern:%s with operator:%s", $text, $patt, $op); + $match = $text; last ; + } else { + dbg("uri: Not match: text:%s not matches the pattern:%s with operator:%s", $text, $patt, $op); + } + } + if ( $neg ) { + next if defined $match; + dbg("uri: text negative matched: %s /%s/", $op,$patt); + } else { + next unless defined $match; + dbg("uri: text matched: '%s' %s /%s/", $match,$op,$patt); + } + } + if (exists $rule->{domain}) { my($op,$patt,$neg) = @{$rule->{domain}}; my $match; On Mon, Feb 3, 2025 at 10:01 AM Jimmy <thana...@gmail.com> wrote: > Thanks for the detailed analysis of the uri_detail plugin bug. I > appreciate you taking the time to investigate this so thoroughly. > > I'll open a bug report with the SpamAssassin project, including the > details from your analysis and a sample spam email that demonstrates the > problem. > > Thanks again for your help! > > > On Mon, Feb 3, 2025 at 6:15 AM John Hardin <jhar...@impsec.org> wrote: > >> On Sun, 2 Feb 2025, Jimmy wrote: >> >> > dbg: uri: Not match: >> > >> text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5} >> > not matches the >> > >> pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\}) >> > with operator:=~ >> >> Okay, I finally had some time to sit down and poke at this rather than >> just give a quick off-the-cuff shot in the dark. >> >> I think that the uri_detail plugin is broken w/r/t matching explicit >> bytes >> in the anchor text using the \x00 notation, but it's a little beyond my >> Perl skills and familiarity with the code base to completely analyze. The >> problem may affect more than just uri_detail anchor text rules. >> >> The doubled backslashes are just logging behavior, they are not >> indicators >> of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte >> for logging. The regex should not attempt to match *that* exactly. >> >> Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which >> contains "\s" successfully matches: >> >> dbg: uri: text matched: 'visit here to opt out or write a letter to the >> address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/ >> >> ...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes: >> >> dbg: uri: text matched: 'visit here to opt out or write a letter to the >> address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/ >> >> ...but \x00 notation for non-ASCII data does not: >> >> uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/ >> >> uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running >> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' >> >> =~ /(?^aa:\\xe0)/ >> >> (does not match) >> >> >> Explicit Unicode hex in regexes does work *outside* the uri_detail anchor >> text context: >> >> body __UNICODE_BODY >> /\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/ >> >> dbg: rules: ran body rule __UNICODE_BODY ======> got hit: >> "\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}" >> >> ...so you might want to just do a regular body rule for the anchor text >> until this gets fixed. >> >> >> Can anyone explain what the "(?^aa:" in the uri_detail regex means? I >> suspect that's extremely relevant but I couldn't find anything online >> that >> explains it. Maybe that specifies something related to character encoding >> that's breaking the interpretation of the regex as a raw Unicode hex >> string. >> >> That appears to be added by Mail::SpamAssassin::Util compile_regexp(), >> apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802 >> per >> >> https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h >> Maybe we need to be a little less eager to broadly apply /aa ? >> >> No, I took that out and it's still not hitting on the Unicode hex pattern: >> >> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running >> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' >> >> =~ /(?^:\\xE0\\xB8\\x97)/ >> >> (does not match) >> >> FWIW pasting the raw unicode character into the regex also does not work: >> >> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running >> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' >> >> =~ /(?^aa:\x{E0}\x{B8}\x{95})/ >> >> (does not match) >> >> >> >> You should probably open a bug with your rule and attach the spample. >> >> >> >> >> -- >> John Hardin KA7OHZ http://www.impsec.org/~jhardin/ >> jhar...@impsec.org pgpk -a jhar...@impsec.org >> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 >> ----------------------------------------------------------------------- >> Rights can only ever be individual, which means that you cannot >> gain a right by joining a mob, no matter how shiny the issued >> badges are, or how many of your neighbors are part of it. -- Marko >> ----------------------------------------------------------------------- >> 10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays >> >