[SAtalk] Phrases I have modified....

VonEssen, John Wed, 08 Oct 2003 11:33:37 -0700

Just food for thought for the next release...

I have been seeing more and more spam using different phrases for
"remove me" phrases.

Some use the work "cease":

Cease offer(s)
Cease update(s)
Cease email
Cease mailing(s)

John

-----Original Message-----
From: Scott A Crosby [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 08, 2003 12:37 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: [SAtalk] Re: holy cow, FN city

On Wed, 8 Oct 2003 08:34:46 -0700 (PDT), [EMAIL PROTECTED] writes:

> Wow... 10 false negatives this morning. =/ 
> 
> Is 2.60's bayes really a lot better than 2.55's?
> Here's an example of a FN that came through this morning:

> Notice the gobbledygook text at the end -

Sure. The goal of that is to add in new tokens that are unique and
have never been seen before. Those can bias an email toward neutral.

> <DIV>gmifewdxnavfo xlmdhwdeqb tftwgocpmkxh mfhfnpdaatb</DIV>
> <DIV>phjtdedsnnxdz ciwqencxdspt dztzeabyeumkc jmldxrchpoyvt 
> lgnzxrcjncoyv</DIV>
> <DIV>wstcrjdwjshjsc esumvrbqll</DIV>
> <DIV>hccwdohenxnn nptaihbczsbeir tjicwvdyewxii dcekolccikrej
qmgblgcgowf 
> fhncedbistifx

I can see several ways of dealing with them. The first approaches

First, the character probabilities of the preceding lines are very
unlike english --- too many consonants. So, this particular case can
be detected if any portion of an email has written text that is
statistically very different from ordinary english. The spamware
reaction to this is to bias the character probabilities to resemble
english. So repeat this again, except use bigram (character pair)
probabilities. So, text that has a 'q' not followed by a 'u' would
look alien. 

These statistical tests mean that spamware must use real english
words, or text that at least resembles real english words. To detect
the second case, have SA look up each new token in a dictionary, and
note if it isn't found. Again, if one portion of a message has too
many non-english words, that is a spam sign.

These could be useful tests in general to detect email in a foreign
language, not just avoid bayes poisoning.

A second and perhaps stronger sign: this group of text contains a
large number of tokens that have never been seen before. This can be
detected by an adaptive threshold, as more ham is learned, the
threshold for 'too many new tokens' can decrease.

Scott

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Phrases I have modified....

Reply via email to