Way ahead of you Don. SA implements identifying "spam phrases" which are actually word pairs common in spam but uncommon in regular mail.
C on 2/1/02 1:08 PM, Donald Greer at [EMAIL PROTECTED] wrote: > Folks, > I don't know if it's possible (I sure don't know how to do it myseld > ;^) but perhaps one could take a known spam database and a known > non-spam database and use these to automatically build a list of > possible "spammish" words (sorta like the GA, but actually finding the > words and phrases, not the scores)? > What I'm thinking is something like this: > For each unique message in the database: count all unique words > excluding _common_word_list_ ("the","a","I",etc.); find the average > count for each word in spam and non-spam; subtract the average non-spam > count from the average spam count; and look REALLY HARD at the top > 5-10%. Possibly look at the bottom 5-10% for possible negative weights > (things that indicate the message is legit). > One could do the same for 2-4 word phrases ("enlarge penis", "bigger > breasts", etc.). Once you've got this list and decided where the > "cut-off" is, then add them to the collection and run it through the GA > on a _SEPERATE_ spam collection and see how they score. This is > something that could be done periodically to keep the list of keywords > up-to-date with modern spam. > Just an idea. > Don _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk