Re: URI with spaces are not recognized

Franz Schwartau Sat, 14 Feb 2009 02:52:18 -0800

Hi John!

John Hardin wrote:
> On Fri, 13 Feb 2009, Benny Pedersen wrote:
> 
>> On Fri, February 13, 2009 18:12, John Hardin wrote:
>>> If a URI rule works, what's wrong with a body rule?
>>
>> nothing wroung making bad rules either, point is that if bad rules
>> is needed one have also bad behaving browser problem
> 
> Why should the fact that a mail client won't render that URI as a
> clickable link mean there shouldn't be a rule for it? Spammers have been
> obfuscating URIs in this manner for a long time. There's nothing wrong
> with rules for obfuscated URIs.

Thanks for pointing out! :-) Our primary goal is to identify spam, not
to prevent people from typing these obfuscated URLs in their browser...

> OT: Benny, could you refrain from setting your Reply-To to the email
> address of the original poster? Setting it to the mailing list address
> is fine, but setting it to the original poster is just
> passive-aggressive rudeness.
> 
> On Fri, 13 Feb 2009, Franz Schwartau wrote:
> 
>> So, does anyone know a more general solution for this kind of spam
>> instead of individual body rules?
> 
> You might try a rule like:
> 
>  body URI_SPC_OBFU_SPC
> /\bwww\s{1,20}\.\s{1,20}\w{5,20}\s{1,20}\.\s{1,20}net\b/i
> 
> I think it would be risky to make the URI parser attempt too much
> deobfuscation; however, accepting \s+\.\s+ as \. might be justified.
> Perhaps \s+dot\s+ as well.
> 
> If the spammer uses something more complex they're reducing the
> likelihood the recipient will bother to deobfuscate the URI, and it's
> more likely to be caught by bayes, so I'd suggest the ROI to SA for
> making it more aggressive isn't large enough.

I thought about this generic body rule, too. Unfortunally this rule
catches also legitimate mistyped URLs containing spaces. Think of users
typing URLs fast and hitting the space bar accidentally while typing. ;-)

After reading PerMsgStatus.pm again another idea came up. Instead of
modifying $schemelessRE (which wouldn't help anyway) the URLs containing
spaces are replaced by URLs without spaces before spamassassin gathers
URIs. Thus all URI specific rules can be applied (e. g. uri directive
and URI blacklists).

The regexp is kept simple intentionally and matches legitimate (without
spaces) URLs as well but this doesn't hurt much.

This patch works for me and perhaps someone else finds it useful.
Comments are welcome, too. :-)

        Best regards
                Franz

--- PerMsgStatus.pm.new.orig    2009-02-14 11:21:20.000000000 +0100
+++ PerMsgStatus.pm.new 2009-02-14 11:20:54.000000000 +0100
@@ -1417,7 +1417,13 @@
 =cut
 
 sub get_decoded_stripped_body_text_array {
-  return $_[0]->{msg}->get_rendered_body_text_array();
+  my $textary = $_[0]->{msg}->get_rendered_body_text_array();
+
+  for (@$textary) {
+    
s/(www)\s{0,2}\.\s{0,2}([a-z\d._-]{10,32})\s{0,2}\.\s{0,2}((net|org))/$1.$2.$3/i;
+  }
+
+  return $textary;
 }
 
 ###########################################################################

Re: URI with spaces are not recognized

Reply via email to