Chris Owen wrote:
> On Jul 13, 2009, at 2:55 PM, Charles Gregory wrote:
> 
>>>> To answer your next post, I don't use '\b' because the next 'trick'
>>>> coming
>>>> will likely be something looking like Xwww herenn comX...  :)
>>> At that point it can be dealt with.
> 
>> Well, they're getting close. I'm seeing non-alpha non-blank crud
>> cozied up to the front of the 'www' now.... :)

Not forgetting underscores are not word boundaries.  My alternative
rules are badly written but are still hitting with the \b:

rawbody NONLINK_SHORT
/^.{0,500}\b(?:H\s*T\s*T\s*P\s*[:;](?<!http:)\W{0,10}|W\s{0,10}W\s{0,10}W\s{0,10}(?:[.,\'`_+\-]\s{0,10})?(?<!www\.))[a-z0-9\-]{3,13}\s{0,10}(?:[.,\'`_+\-]\s{0,10})?(?<![a-z0-9]\.)(?:net|c\s{0,10}o\s{0,10}m|org|info|biz)\b/si
describe NONLINK_SHORT          Obfuscated link near top of text
score NONLINK_SHORT             2.5

#quite strict:
rawbody NONLINK_VSHORT          /^.{0,100}\bwww{0,2}(?:\. | \.|
?[,*_\-\+] ?)[a-z]{2,5}[0-9\-]{1,5}(?:\. | \.| ?[,*_\-\+]
?)(?:net|c\s{0,10}o\s{0,10}m|org|info|biz)(?:\. \S|\s*$)/s
describe NONLINK_VSHORT         Specific obfuscated link form near top
of text
score NONLINK_VSHORT            2.5

(These use rawbody with a caret to limit the area of matching to the
first few lines.)

So how about dropping the \b and using something looser like: 'w
?w(?!\.[a-z0-9\-]{2,12}\.(?:com|info|net|org|biz))[[:punct:]X
]{1,4}[a-z0-9\-]{2,12}[[:punct:]X ]{1,4}(?:c ?o ?m|info|n ?e ?t|o ?r
?g|biz)([[:punct:]X ]|$)'   ...?

> 
> 
> Which of course means we've long since passed the point where any of
> these are going to do the spammers any good.  That's the frustrating part.

You're making the common assumption that spammers send UCE because it
makes them money.  In fact they do it because they are obnoxious
imbeciles who want to annoy people and waste as much time (human and
CPU) as possible.  I don't think it really matters to them that what
they are sending is incomprehensible noise, because noise is their message.

Cheers

CK

Reply via email to