[SAtalk] SA 2.60 bayes_journal oddities

Vesa-Matti J Kari Tue, 02 Dec 2003 06:52:43 -0800

Hello all,

Here is a short summary of what we're trying to accomplish.


  GOAL: To recognize junk mails with a site wide SpamAssassin installation
  PROGRAMS: Sendmail 8.12.10 + Miltrassassin + Spamd/SA 2.60
  PROBLEM: SpamAssassin's bayesian subsystem does not seem to behave as expected (?)

Our setup works perfectly except for one thing: SA's bayesian system seems to
do some strange things. We have fed SA with lots of ham and spam and the actual 
spam recognition functions well. That is great and a reason to be happy
about.

However, due to a large mail volume, we do not want to use automatic 
learning. Instead, we want to do manual updates to the site wide databases 
every now and then. The main configuration file /etc/mail/spamassassin/local.cf 
looks like this:

rewrite_subject 0
skip_rbl_checks 1
report_safe 0
use_bayes 1
bayes_auto_expire 0
bayes_journal_max_size 0
bayes_path /var/spool/sa/bayes
bayes_auto_learn 0
dns_available yes

As you can see, we have instructed SA not to automatically expire tokens *and*
not to try to "auto learn". But each time spamd starts, /var/spool/sa/bayes_journal 
is created and for each message that passes through, the tokens in the message
get written to the bayes_journal, making it grow larger and larger. I cannot see the
point of it, because we have both "bayes_auto_learn 0" and "bayes_auto_expire 0".
Can somebody explain why does this happen? I read a reply from the
archives that said something to the effect that this does not have to do
with learning. What is the point of this behavior, then?

We also have set "bayes_journal_max_size 0", since "man sa-learn" 
explains the opportunistic syncing algorithm:

------------------------------------------------------------------------------

SpamAssassin can sync the journal and expire the DB tokens either manu-
       ally or opportunistically.  A journal sync is due if --rebuild is
       passed to sa-learn (manual), or if the following is true (opportunis-
       tic):

       - bayes_journal_max_size does not equal 0 (means don't sync)
       - the journal file exists

       and either:

       - the journal file has a size greater than bayes_journal_max_size

       or

       - at least 1 day has passed since the last journal sync

------------------------------------------------------------------------------

However, the above statement is ambiguous to me. It could mean two
things:

1. (P and Q and (R or S))

2. (P and Q and R) or S

Based on the "man sa-learn", I would go for the first interpretation,
especially because it says "means don't sync". Surprisingly
the manual page Mail::SpamAssassin::Conf seems to support the second
interpretation:

------------------------------------------------------------------------------
 bayes_journal_max_size        (default: 102400)
           SpamAssassin will opportunistically sync the journal and the
           database.  It will do so at least once a day, but can also sync if
           the file size goes above this setting, in bytes.  If set to 0, the
           journal sync will only occur once a day.

------------------------------------------------------------------------------

This is quite confusing. According to "man sa-learn", bayes_journal_max_size
meant never sync, but this one claims that the journal sync will occur once a day
no matter what.

Worse, whichever interpretation is correct, I suppose we still have a problem:

1. If the bayes_journal file never gets synced, then it just keeps on 
   growing and growing. That is not desirable. 

2. If the journal gets synced once a day, then the database will grow 
   and since we have "bayes_auto_expire 0", it will keep on growing and
   growing. This is not desirable either.

Here is another excerpt from "man sa-learn":

------------------------------------------------------------------------------
bayes_journal
           While SpamAssassin is scanning mails, it needs to track which
           tokens it uses in its calculations.  So that many processes can
           read the databases simultaneously, but only one can write at a
           time, this uses a 'journal' file.
------------------------------------------------------------------------------

I cannot understand why the bayes_journal is created and written to in the first
place even when no process should not have to write anything. As far as I can 
tell, in our configuration recognizing spam should be a matter of just 
reading the token databases that we have created by manually teaching SA. 
No process (except manual runs of sa-learn) should have the need to 
write or update anything. Any explanations are very welcome.

What is the correct way to configure SA to accomplish our goal? Thanks a 
lot in advance.

Regards,
vmk
-- 
****************************************************************************
                 "Facts are stupid things" - Ronald Reagan
****************************************************************************


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] SA 2.60 bayes_journal oddities

Reply via email to