Re: Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-03 Thread Jimmy
To address the issue of matching anchor text containing Unicode characters, I've implemented a new rule option called unicode_text. This option ensures that the anchor text is converted to Unicode before being compared against the rule's regular expression. As a result, the following rule now cor

Re: Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-02 Thread Jimmy
Thanks for the detailed analysis of the uri_detail plugin bug. I appreciate you taking the time to investigate this so thoroughly. I'll open a bug report with the SpamAssassin project, including the details from your analysis and a sample spam email that demonstrates the problem. Thanks again fo

Re: Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-02 Thread John Hardin
On Sun, 2 Feb 2025, Jimmy wrote: dbg: uri: Not match: text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5} not matches the patt

Re: Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-01 Thread Jimmy
*When adding debug to source like this: * * if (exists $rule->{text}) { next unless $info->{anchor_text}; my($op,$patt,$neg) = @{$rule->{text}}; my $match; for my $text (@{ $info->{anchor_text} }) {if ( ($op eq '=~' && $text =~ $patt) || ($op

Re: Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-01 Thread John Hardin
On Sun, 2 Feb 2025, Jimmy wrote: Hello, I am experiencing difficulties creating a rule to match UTF-8 anchor text using the plugin, and I suspect there might be a bug related to UTF-8 matching. For example, I attempted to use the following rule: uri_detail UNICODE_LINK_TEXT text =~ /\\x{E0}\\

Issue with Matching UTF-8 Anchor Text in URIDetail plugin

2025-02-01 Thread Jimmy
Hello, I am experiencing difficulties creating a rule to match UTF-8 anchor text using the plugin, and I suspect there might be a bug related to UTF-8 matching. For example, I attempted to use the following rule: uri_detail UNICODE_LINK_TEXT text =~ /\\x{E0}\\x{B8}\\x{97}\\x{E0}\\x{B8}\\x{B1}\\x