Re: sa-learn on a 15,000 email mbox file?

snowjack 29 Nov 2004 21:24:10 -0000

On Mon, 29 Nov 2004 12:01:02 -0900, "Andy Firman" <[EMAIL PROTECTED]> said:
> I just started using Spamassasin 3.0 and am very
> impressed with it.  Recently, on an old server that I 
> just started to manage,  I just found a spam
> infested mbox spool file with 15,000 spams in it. (52MB)
> Nobody had checked the mailbox in about 10 months.
> 
> Is it a good idea to run sa-learn on this giant spam
> mbox file on other servers that I get SA 3.0 installed on?
> 
> Or no?
>


Unless the address has never been used by a real person, you should
manually check each message to see whether it's spam. Personally, I
never have the endurance to check more than about 500 messages at a
shot. So I'd just cut it into files of a size I could manually verify
without bleeding from the eyes, delete any hammy-looking stuff I find in
each file as I go through it, and then save the verified files and use
those for bayes training.

It would be safe to do what you propose if the account is one that you
are certain will never receive legit mail, but old mail accounts *will*
still get the occasional legit message. "Hey Bob, why haven't I heard
from you in the past eight months? Here's all our new customer info..."

For ongoing Bayes training, I have two IMAP folders that I copy messages
into, one for ham and one for spam. Any spams scoring less than 10 get
manually copied into the spam folder (the rest of the spam is rejected
at the mail gateway). Periodically I run through a bunch of recent ham
and copy it into the ham folder. A nightly script cleans out those IMAP
folders, runs sa-learn on the messages, and copies them into ham/spam
folders on the server, so I can use those if I need a corpus of manually
verified messages.
--
  
  snowjack(a)fastmail.fm

Re: sa-learn on a 15,000 email mbox file?

Reply via email to