Just food for thought for the next release... I have been seeing more and more spam using different phrases for "remove me" phrases.
Some use the work "cease": Cease offer(s) Cease update(s) Cease email Cease mailing(s) John -----Original Message----- From: Scott A Crosby [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 08, 2003 12:37 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: [SAtalk] Re: holy cow, FN city On Wed, 8 Oct 2003 08:34:46 -0700 (PDT), [EMAIL PROTECTED] writes: > Wow... 10 false negatives this morning. =/ > > Is 2.60's bayes really a lot better than 2.55's? > Here's an example of a FN that came through this morning: > Notice the gobbledygook text at the end - Sure. The goal of that is to add in new tokens that are unique and have never been seen before. Those can bias an email toward neutral. > <DIV>gmifewdxnavfo xlmdhwdeqb tftwgocpmkxh mfhfnpdaatb</DIV> > <DIV>phjtdedsnnxdz ciwqencxdspt dztzeabyeumkc jmldxrchpoyvt > lgnzxrcjncoyv</DIV> > <DIV>wstcrjdwjshjsc esumvrbqll</DIV> > <DIV>hccwdohenxnn nptaihbczsbeir tjicwvdyewxii dcekolccikrej qmgblgcgowf > fhncedbistifx I can see several ways of dealing with them. The first approaches First, the character probabilities of the preceding lines are very unlike english --- too many consonants. So, this particular case can be detected if any portion of an email has written text that is statistically very different from ordinary english. The spamware reaction to this is to bias the character probabilities to resemble english. So repeat this again, except use bigram (character pair) probabilities. So, text that has a 'q' not followed by a 'u' would look alien. These statistical tests mean that spamware must use real english words, or text that at least resembles real english words. To detect the second case, have SA look up each new token in a dictionary, and note if it isn't found. Again, if one portion of a message has too many non-english words, that is a spam sign. These could be useful tests in general to detect email in a foreign language, not just avoid bayes poisoning. A second and perhaps stronger sign: this group of text contains a large number of tokens that have never been seen before. This can be detected by an adaptive threshold, as more ham is learned, the threshold for 'too many new tokens' can decrease. Scott ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk