On Fri, 23 Jun 2006, Ramprasad wrote: > > Yes, as SA collapses multiple spaces down to a single space (in 'body' > > tests), you only need to look for a single instance of the space, > > not an unlimited number. Also you can omit that final ' *' as it's > > an optional "tail" match, thus the rule will work without it. > > > > IE: > > /1 ?- ?2 ?2 ?- ?3/ > > Wow SA is doing a lot of work already. Can I also have a collapsed body > string with all whitespaces removed > so I could do > > collapsedbody BADNUMBER /1-22-33/ > score BADNUMBER 10 > > I this this will also help get rid of the > "genu ine uni versity degre es"
You do -NOT- want this. As others have already pointed out you can no longer determine word boundaries and increase FP rates. But the real reason is that you will be throwing away the one gift that the spammers have handed you, a good indication for seperating spam from ham. The bane of spam-fighters is FPs. Any good clues should be treasured not discarded. In our environment, we have discussions of 'degrees' often. However I've never seen a legit discussion of 'degr ees' nor 'deg rees' etc. So a simple negative-lookahead rule makes it easy to whack the borked version but not FP on the correct, EG: body FAKE_DEGREE1 /\b(?!degrees)d ?e ?g ?r ?e ?e ?s/i will match any permutation of 'degrees' containing spaces but won't hit the word 'degrees' itself. A well fed Bayes plus a few of these type of rules made this particular spam a non-issue here. ;) Bottom line, when spammers obfsucate words they usually make it -easier- to catch, not harder. ;) Dave -- Dave Funk University of Iowa <dbfunk (at) engineering.uiowa.edu> College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527 #include <std_disclaimer.h> Better is not better, 'standard' is better. B{