Re: [SAtalk] [Q] RFC 822 vs mbox formats.

Nix Mon, 21 Jul 2003 16:13:31 -0700

On Sun, 20 Jul 2003, Daniel Carrera stated:
>>From the SpamAssassin man page:
> 
>       --file                            Learn a file in RFC 822 format
>       --mbox                            Learn a file in mbox format
> 
> What is the RFC 822 format?  What is the mbox format?


--file learns a single email containing an email in RFC 822 format (that
is, um, an email).

(Perhaps it should say RFC2822 format, as RFC822 is obsoleted by it
now.)

The mbox format is a series of RFC822 mails, separated by lines matching
the regular expression "^From ", with any lines beginning "From " in the
body of the email escaped to ">From " to avoid being misinterpreted as
the boundary between emails.

Because it involves changing the body of some email (containing common
English words!) in order to store it, it's generally regarded as
sucking.

> If I'm using mutt, are my mailboxes in the mbox format?

I think mutt can handle mailbox *and* maildir storage; but if you have
one file with more than one mail in, it's almost certainly mbox format.

> Suppose I regularly add files to a mailbox called "spam" and another 
> called "ham".  Can I run:
> 
>        sa-learn --mbox  --spam  spam
>        sa-learn --mbox  --ham   ham
> 
> Regularly without worrying about old emails being cunted multiple times?

Yes; SA remembers the Message-IDs of mails that it's already learnt, and
doesn't learn them again..

> Or do I have to clear the mailbox after every time I use sa-learn?

No.

I have a two-pronged approach. Mail that slips past the filters I hit
with an `sa-learn --spam' by hand, and move by hand to my spam
folder. (The same is theoretically true for misclassified ham, but
that's so rare for me that I can't remember the last time I had to do
it).

Everything else gets handled by a triplet of cron jobs:

# Delete spam from my spam database and Bayes classifier that's more than six months 
old.
# (Such spam is of little use anywhere.)

17 3 * * * (for spam in `find /home/nix/Mail/spool/spambox -mtime +180 -type f`; do 
sa-learn --forget --single < $spam; rm -f $spam; done) >/dev/null 2>&1

# Re-educate SpamAssassin's Bayesian analyzer every month

0 10 6 * * sa-learn --ham --no-rebuild --dir /home/nix/Mail/spool/Mailbox; sa-learn 
--spam --no-rebuild --dir /home/nix/Mail/spool/spambox; sa-learn --rebuild

# Repopulate Bayes from its journal every day

11 8 * * * sa-learn --rebuild >/dev/null 2>&1


(If you are using mbox format, the first of these cron jobs is very
unlikely to be useful to you, as it requires maildir-style storage in
which every email is stored in a file of its own.)

> One more question:  How do I activate/deactivate the Bayesian filter of 
> SA?  I understand that Bayesian filters only work when you have a large 
> sample of sample spam and ham.

It activates itself when it's been trained in >200 hams and >200 spams.
You almost certainly won't want to deactivate it (to do that, you'd have
to set the scores of all the BAYES_SCORE rules to 0).

> How does the Bayesian features of SA relate to the scoring system?  Do you 
> use either one or the other, or are both used in conjunction?

You use both.

The score for a mail in SA is presently treated as the additive sum of a
set of scores. The scores are derived by feeding the results of rule
hits over a large set of hand-tested spam and nonspam mail to a genetic
algorithm, and asking it to determine the rule scores that would
classify as much as possible of that mail correctly, with a heavy bias
towards misclassifying spam as ham rather than the other way around.

The Bayesian filter in SA returns a probability that a given piece of
mail is spam, given its previous training in the spam and ham
categories. There are rules that take that probability and yield a
score; see /usr/share/spamassassin/23_bayes.cf.

So Bayes serves to push up (or down) the score of candidate spam or ham,
and thus make it more spammy or more hammy.


In some respects this system is probably too linear and `simpleminded'
--- the discrete division of Bayes probabilities into SA scores doesn't
feel right to me, for instance --- but it works reasonably well. :)

-- 
`We cannot get a new line down the pipe due to a blockage and we cannot
 dig up the road to clear the blockage because it is covered with the
 wrong type of tarmac.' --- British Telecom, via Mark Lowes


-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] [Q] RFC 822 vs mbox formats.

Reply via email to