On Sun, 20 Jul 2003, Daniel Carrera stated: >>From the SpamAssassin man page: > > --file Learn a file in RFC 822 format > --mbox Learn a file in mbox format > > What is the RFC 822 format? What is the mbox format?
--file learns a single email containing an email in RFC 822 format (that is, um, an email). (Perhaps it should say RFC2822 format, as RFC822 is obsoleted by it now.) The mbox format is a series of RFC822 mails, separated by lines matching the regular expression "^From ", with any lines beginning "From " in the body of the email escaped to ">From " to avoid being misinterpreted as the boundary between emails. Because it involves changing the body of some email (containing common English words!) in order to store it, it's generally regarded as sucking. > If I'm using mutt, are my mailboxes in the mbox format? I think mutt can handle mailbox *and* maildir storage; but if you have one file with more than one mail in, it's almost certainly mbox format. > Suppose I regularly add files to a mailbox called "spam" and another > called "ham". Can I run: > > sa-learn --mbox --spam spam > sa-learn --mbox --ham ham > > Regularly without worrying about old emails being cunted multiple times? Yes; SA remembers the Message-IDs of mails that it's already learnt, and doesn't learn them again.. > Or do I have to clear the mailbox after every time I use sa-learn? No. I have a two-pronged approach. Mail that slips past the filters I hit with an `sa-learn --spam' by hand, and move by hand to my spam folder. (The same is theoretically true for misclassified ham, but that's so rare for me that I can't remember the last time I had to do it). Everything else gets handled by a triplet of cron jobs: # Delete spam from my spam database and Bayes classifier that's more than six months old. # (Such spam is of little use anywhere.) 17 3 * * * (for spam in `find /home/nix/Mail/spool/spambox -mtime +180 -type f`; do sa-learn --forget --single < $spam; rm -f $spam; done) >/dev/null 2>&1 # Re-educate SpamAssassin's Bayesian analyzer every month 0 10 6 * * sa-learn --ham --no-rebuild --dir /home/nix/Mail/spool/Mailbox; sa-learn --spam --no-rebuild --dir /home/nix/Mail/spool/spambox; sa-learn --rebuild # Repopulate Bayes from its journal every day 11 8 * * * sa-learn --rebuild >/dev/null 2>&1 (If you are using mbox format, the first of these cron jobs is very unlikely to be useful to you, as it requires maildir-style storage in which every email is stored in a file of its own.) > One more question: How do I activate/deactivate the Bayesian filter of > SA? I understand that Bayesian filters only work when you have a large > sample of sample spam and ham. It activates itself when it's been trained in >200 hams and >200 spams. You almost certainly won't want to deactivate it (to do that, you'd have to set the scores of all the BAYES_SCORE rules to 0). > How does the Bayesian features of SA relate to the scoring system? Do you > use either one or the other, or are both used in conjunction? You use both. The score for a mail in SA is presently treated as the additive sum of a set of scores. The scores are derived by feeding the results of rule hits over a large set of hand-tested spam and nonspam mail to a genetic algorithm, and asking it to determine the rule scores that would classify as much as possible of that mail correctly, with a heavy bias towards misclassifying spam as ham rather than the other way around. The Bayesian filter in SA returns a probability that a given piece of mail is spam, given its previous training in the spam and ham categories. There are rules that take that probability and yield a score; see /usr/share/spamassassin/23_bayes.cf. So Bayes serves to push up (or down) the score of candidate spam or ham, and thus make it more spammy or more hammy. In some respects this system is probably too linear and `simpleminded' --- the discrete division of Bayes probabilities into SA scores doesn't feel right to me, for instance --- but it works reasonably well. :) -- `We cannot get a new line down the pipe due to a blockage and we cannot dig up the road to clear the blockage because it is covered with the wrong type of tarmac.' --- British Telecom, via Mark Lowes ------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk