Re: [SAtalk] Spam Collecting

Rich Puhek Fri, 16 Jan 2004 14:37:37 -0800

Gary Funck wrote:


It is a pain, esp. on a big mailbox, and you need large sample, of say,
2000/so each of ham and spam to train the Bayes engine.

What I did is fired up 'mutt', and used its 'tag' capabilities to
tag the spam that I wanted to extract and deposit into my spam sample. It is
important to remember that this low-scoring spam
is exactly the stuff that will help Bayes do a better job.
Anyway, I'd first sort by sender's address, and then find the
obvious outliers who were spammers, tag those and then write/append
them to a spam mbox. I'd also sort by subject and rescan manually
for spam. It still took some time, but doing things this way
eased the pain.

I use a slightly different approach.

I filter my emails into 4 different IMAP folders: slightly-spammy, somewhat-spammy, pretty-spammy, and very spammy. The filtering is based on increasing number of SA hits (actually the X-Spam-Level: header, and the number of "*" characters).

I also have a to-learn folder, with a pair of subfolders: ham and spam, which are not automatically populated.

Any FN that land in my inbox are manually moved to the to-learn.spam folder.

On regular intervals I do the following:

I scan through the slightly-spammy folder, copy any ham to the ham folder, and move the original to my inbox. Any spam gets moved to my spam "to-learn" folder.

The "somewhat-spammy" folder gets a quick look for the rare FP, I follow the same procedure for that folder as the "slightly-spammy". It's just quicker to scan, since FPs there are rare.

The last two folders are pretty high scoring. I'm thinking of combining them, since they get treated the same. I do a quick look for FP messages (once in a great while, someone on debian-user posts from an IP that's in just about every rbl I use, otherwise no FPs). After handling them, I've been searching for messages that *don't* hit BAYES_99, and move them into the to-learn.spam folder. The rest, I just delete (this procedure gets plenty of spam as it is).

Eventually (haven't automated this step yet, since my IMAP server and SA server are on different boxes), I run sa-learn on my to-learn folders. After that, I move those messages into a corpus.

Most of the movement is done from Netscape, since I use a windows machine for day to day work, but it obviously would work in any IMAP situation.

--Rich


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Spam Collecting

Reply via email to