On Fri, 23 Jun 2006, Ramprasad wrote:

> > Yes, as SA collapses multiple spaces down to a single space (in 'body'
> > tests), you only need to look for a single instance of the space,
> > not an unlimited number. Also you can omit that final ' *' as it's
> > an optional "tail" match, thus the rule will work without it.
> >
> > IE:
> >   /1 ?- ?2 ?2 ?- ?3/
>
> Wow SA is doing a lot of work already. Can I also have a collapsed body
> string with all whitespaces removed
> so I could do
>
> collapsedbody BADNUMBER /1-22-33/
> score BADNUMBER 10
>
> I this this will also help get rid of the
>     "genu ine   uni versity  degre es"

You do -NOT- want this. As others have already pointed out you
can no longer determine word boundaries and increase FP rates.
But the real reason is that you will be throwing away the one
gift that the spammers have handed you, a good indication for
seperating spam from ham.
The bane of spam-fighters is FPs. Any good clues should be treasured
not discarded.

In our environment, we have discussions of 'degrees' often.
However I've never seen a legit discussion of 'degr ees' nor 'deg rees'
etc. So a simple negative-lookahead rule makes it easy to
whack the borked version but not FP on the correct, EG:

  body FAKE_DEGREE1     /\b(?!degrees)d ?e ?g ?r ?e ?e ?s/i

will match any permutation of 'degrees' containing spaces but
won't hit the word 'degrees' itself.

A well fed Bayes plus a few of these type of rules made this particular
spam a non-issue here. ;)

Bottom line, when spammers obfsucate words they usually make it -easier-
to catch, not harder. ;)

Dave

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to