Re: [SAtalk] Suggestion: Automated Word/Phase Discovery

Craig Hughes Fri, 01 Feb 2002 13:04:17 -0800

Way ahead of you Don.  SA implements identifying "spam phrases" which are
actually word pairs common in spam but uncommon in regular mail.


C

on 2/1/02 1:08 PM, Donald Greer at [EMAIL PROTECTED] wrote:

> Folks,
> I don't know if it's possible (I sure don't know how to do it myseld
> ;^) but perhaps one could take a known spam database and a known
> non-spam database and use these to automatically build a list of
> possible "spammish" words (sorta like the GA, but actually finding the
> words and phrases, not the scores)?
> What I'm thinking is something like this:
> For each unique message in the database: count all unique words
> excluding _common_word_list_ ("the","a","I",etc.); find the average
> count for each word in spam and non-spam; subtract the average non-spam
> count from the average spam count; and look REALLY HARD at the top
> 5-10%.  Possibly look at the bottom 5-10% for possible negative weights
> (things that indicate the message is legit).
> One could do the same for 2-4 word phrases ("enlarge penis", "bigger
> breasts", etc.).  Once you've got this list and decided where the
> "cut-off" is, then add them to the collection and run it through the GA
> on a _SEPERATE_ spam collection and see how they score.  This is
> something that could be done periodically to keep the list of keywords
> up-to-date with modern spam.
> Just an idea.
> Don


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Suggestion: Automated Word/Phase Discovery

Reply via email to