[SAtalk] Suggestion: Automated Word/Phase Discovery

Donald Greer Fri, 01 Feb 2002 12:53:03 -0800

   Folks,
   I don't know if it's possible (I sure don't know how to do it myseld 
;^) but perhaps one could take a known spam database and a known 
non-spam database and use these to automatically build a list of 
possible "spammish" words (sorta like the GA, but actually finding the 
words and phrases, not the scores)?
   What I'm thinking is something like this:
   For each unique message in the database: count all unique words 
excluding _common_word_list_ ("the","a","I",etc.); find the average 
count for each word in spam and non-spam; subtract the average non-spam 
count from the average spam count; and look REALLY HARD at the top 
5-10%.  Possibly look at the bottom 5-10% for possible negative weights 
(things that indicate the message is legit).
   One could do the same for 2-4 word phrases ("enlarge penis", "bigger 
breasts", etc.).  Once you've got this list and decided where the 
"cut-off" is, then add them to the collection and run it through the GA 
on a _SEPERATE_ spam collection and see how they score.  This is 
something that could be done periodically to keep the list of keywords 
up-to-date with modern spam.
   Just an idea.
   Don


-- 
--------------------------------------------------------
Donald L. Greer, Jr                  [EMAIL PROTECTED]
System Administrator                 Voice: 512-300-0176
AustinTX                        http://www.AustinTX.COM/
   All opinions are my own.  Flame me directly.

"I don't necessarily believe software should be free...
but if you pay for it, it should work!" -- Me


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Suggestion: Automated Word/Phase Discovery

Reply via email to