On Mon, 29 Nov 2004 12:01:02 -0900, "Andy Firman" <[EMAIL PROTECTED]> said: > I just started using Spamassasin 3.0 and am very > impressed with it. Recently, on an old server that I > just started to manage, I just found a spam > infested mbox spool file with 15,000 spams in it. (52MB) > Nobody had checked the mailbox in about 10 months. > > Is it a good idea to run sa-learn on this giant spam > mbox file on other servers that I get SA 3.0 installed on? > > Or no? >
Unless the address has never been used by a real person, you should manually check each message to see whether it's spam. Personally, I never have the endurance to check more than about 500 messages at a shot. So I'd just cut it into files of a size I could manually verify without bleeding from the eyes, delete any hammy-looking stuff I find in each file as I go through it, and then save the verified files and use those for bayes training. It would be safe to do what you propose if the account is one that you are certain will never receive legit mail, but old mail accounts *will* still get the occasional legit message. "Hey Bob, why haven't I heard from you in the past eight months? Here's all our new customer info..." For ongoing Bayes training, I have two IMAP folders that I copy messages into, one for ham and one for spam. Any spams scoring less than 10 get manually copied into the spam folder (the rest of the spam is rejected at the mail gateway). Periodically I run through a bunch of recent ham and copy it into the ham folder. A nightly script cleans out those IMAP folders, runs sa-learn on the messages, and copies them into ham/spam folders on the server, so I can use those if I need a corpus of manually verified messages. -- snowjack(a)fastmail.fm