To address the issue of matching anchor text containing Unicode characters,
I've implemented a new rule option called unicode_text.  This option
ensures that the anchor text is converted to Unicode before being compared
against the rule's regular expression.  As a result, the following rule now
correctly matches the specified Thai characters:

uri_detail  __TZ_PHISH_LINK_TH1  unicode_text =~
/\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}/
type =~ /^a$/

This change allows for accurate matching of Unicode characters within the
uri_detail context.


--- Wed Jan 22 16:01:38 2025 UTC
@@ -151,7 +151,7 @@
  my $pattern = $3;
         my $neg = 0;

- if ($target !~ /^!?(?:raw|type|cleaned|text|domain|host)$/) {
+ if ($target !~ /^!?(?:raw|type|cleaned|text|unicode_text|domain|host)$/) {
     return $Mail::SpamAssassin::Conf::INVALID_VALUE;
  }

@@ -271,6 +271,30 @@
       }
     }

+    if (exists $rule->{unicode_text}) {
+      next unless $info->{anchor_text};
+      my($op,$patt,$neg) = @{$rule->{unicode_text}};
+      my $match;
+      for my $text (@{ $info->{anchor_text} }) {
+       use Encode qw(decode encode);
+       $text = encode("UTF-8", $text);
+        if ( ($op eq '=~' && $text =~ $patt) ||
+             ($op eq '!~' && $text !~ $patt) ) {
+                dbg("uri: Match found: text:%s matches the pattern:%s with
operator:%s", $text, $patt, $op);
+                $match = $text; last ;
+           } else {
+                dbg("uri: Not match: text:%s not matches the pattern:%s
with operator:%s", $text, $patt, $op);
+           }
+      }
+      if ( $neg ) {
+        next if defined $match;
+        dbg("uri: text negative matched: %s /%s/", $op,$patt);
+      } else {
+        next unless defined $match;
+        dbg("uri: text matched: '%s' %s /%s/", $match,$op,$patt);
+      }
+    }
+
     if (exists $rule->{domain}) {
       my($op,$patt,$neg) = @{$rule->{domain}};
       my $match;

On Mon, Feb 3, 2025 at 10:01 AM Jimmy <thana...@gmail.com> wrote:

> Thanks for the detailed analysis of the uri_detail plugin bug.  I
> appreciate you taking the time to investigate this so thoroughly.
>
> I'll open a bug report with the SpamAssassin project, including the
> details from your analysis and a sample spam email that demonstrates the
> problem.
>
> Thanks again for your help!
>
>
> On Mon, Feb 3, 2025 at 6:15 AM John Hardin <jhar...@impsec.org> wrote:
>
>> On Sun, 2 Feb 2025, Jimmy wrote:
>>
>> > dbg: uri: Not match:
>> >
>> text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}
>> > not matches the
>> >
>> pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\})
>> > with operator:=~
>>
>> Okay, I finally had some time to sit down and poke at this rather than
>> just give a quick off-the-cuff shot in the dark.
>>
>> I think that the uri_detail plugin is broken w/r/t matching explicit
>> bytes
>> in the anchor text using the \x00 notation, but it's a little beyond my
>> Perl skills and familiarity with the code base to completely analyze. The
>> problem may affect more than just uri_detail anchor text rules.
>>
>> The doubled backslashes are just logging behavior, they are not
>> indicators
>> of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte
>> for logging. The regex should not attempt to match *that* exactly.
>>
>> Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which
>> contains "\s" successfully matches:
>>
>> dbg: uri: text matched: 'visit here to opt out or write a letter to the
>> address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/
>>
>> ...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes:
>>
>> dbg: uri: text matched: 'visit here to opt out or write a letter to the
>> address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/
>>
>> ...but \x00 notation for non-ASCII data does not:
>>
>> uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/
>>
>> uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
>> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>>
>> =~ /(?^aa:\\xe0)/
>>
>> (does not match)
>>
>>
>> Explicit Unicode hex in regexes does work *outside* the uri_detail anchor
>> text context:
>>
>> body  __UNICODE_BODY
>> /\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/
>>
>> dbg: rules: ran body rule __UNICODE_BODY ======> got hit:
>> "\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}"
>>
>> ...so you might want to just do a regular body rule for the anchor text
>> until this gets fixed.
>>
>>
>> Can anyone explain what the "(?^aa:" in the uri_detail regex means? I
>> suspect that's extremely relevant but I couldn't find anything online
>> that
>> explains it. Maybe that specifies something related to character encoding
>> that's breaking the interpretation of the regex as a raw Unicode hex
>> string.
>>
>> That appears to be added by Mail::SpamAssassin::Util compile_regexp(),
>> apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802
>> per
>>
>> https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h
>> Maybe we need to be a little less eager to broadly apply /aa ?
>>
>> No, I took that out and it's still not hitting on the Unicode hex pattern:
>>
>> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
>> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>>
>> =~ /(?^:\\xE0\\xB8\\x97)/
>>
>> (does not match)
>>
>> FWIW pasting the raw unicode character into the regex also does not work:
>>
>> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
>> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>>
>> =~ /(?^aa:\x{E0}\x{B8}\x{95})/
>>
>> (does not match)
>>
>>
>>
>> You should probably open a bug with your rule and attach the spample.
>>
>>
>>
>>
>> --
>>   John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
>>   jhar...@impsec.org                         pgpk -a jhar...@impsec.org
>>   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
>> -----------------------------------------------------------------------
>>    Rights can only ever be individual, which means that you cannot
>>    gain a right by joining a mob, no matter how shiny the issued
>>    badges are, or how many of your neighbors are part of it.  -- Marko
>> -----------------------------------------------------------------------
>>   10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays
>>
>

Reply via email to