Chris Owen wrote: > On Jul 13, 2009, at 2:55 PM, Charles Gregory wrote: > >>>> To answer your next post, I don't use '\b' because the next 'trick' >>>> coming >>>> will likely be something looking like Xwww herenn comX... :) >>> At that point it can be dealt with. > >> Well, they're getting close. I'm seeing non-alpha non-blank crud >> cozied up to the front of the 'www' now.... :)
Not forgetting underscores are not word boundaries. My alternative rules are badly written but are still hitting with the \b: rawbody NONLINK_SHORT /^.{0,500}\b(?:H\s*T\s*T\s*P\s*[:;](?<!http:)\W{0,10}|W\s{0,10}W\s{0,10}W\s{0,10}(?:[.,\'`_+\-]\s{0,10})?(?<!www\.))[a-z0-9\-]{3,13}\s{0,10}(?:[.,\'`_+\-]\s{0,10})?(?<![a-z0-9]\.)(?:net|c\s{0,10}o\s{0,10}m|org|info|biz)\b/si describe NONLINK_SHORT Obfuscated link near top of text score NONLINK_SHORT 2.5 #quite strict: rawbody NONLINK_VSHORT /^.{0,100}\bwww{0,2}(?:\. | \.| ?[,*_\-\+] ?)[a-z]{2,5}[0-9\-]{1,5}(?:\. | \.| ?[,*_\-\+] ?)(?:net|c\s{0,10}o\s{0,10}m|org|info|biz)(?:\. \S|\s*$)/s describe NONLINK_VSHORT Specific obfuscated link form near top of text score NONLINK_VSHORT 2.5 (These use rawbody with a caret to limit the area of matching to the first few lines.) So how about dropping the \b and using something looser like: 'w ?w(?!\.[a-z0-9\-]{2,12}\.(?:com|info|net|org|biz))[[:punct:]X ]{1,4}[a-z0-9\-]{2,12}[[:punct:]X ]{1,4}(?:c ?o ?m|info|n ?e ?t|o ?r ?g|biz)([[:punct:]X ]|$)' ...? > > > Which of course means we've long since passed the point where any of > these are going to do the spammers any good. That's the frustrating part. You're making the common assumption that spammers send UCE because it makes them money. In fact they do it because they are obnoxious imbeciles who want to annoy people and waste as much time (human and CPU) as possible. I don't think it really matters to them that what they are sending is incomprehensible noise, because noise is their message. Cheers CK