Hello all, Here is a short summary of what we're trying to accomplish.
GOAL: To recognize junk mails with a site wide SpamAssassin installation PROGRAMS: Sendmail 8.12.10 + Miltrassassin + Spamd/SA 2.60 PROBLEM: SpamAssassin's bayesian subsystem does not seem to behave as expected (?) Our setup works perfectly except for one thing: SA's bayesian system seems to do some strange things. We have fed SA with lots of ham and spam and the actual spam recognition functions well. That is great and a reason to be happy about. However, due to a large mail volume, we do not want to use automatic learning. Instead, we want to do manual updates to the site wide databases every now and then. The main configuration file /etc/mail/spamassassin/local.cf looks like this: rewrite_subject 0 skip_rbl_checks 1 report_safe 0 use_bayes 1 bayes_auto_expire 0 bayes_journal_max_size 0 bayes_path /var/spool/sa/bayes bayes_auto_learn 0 dns_available yes As you can see, we have instructed SA not to automatically expire tokens *and* not to try to "auto learn". But each time spamd starts, /var/spool/sa/bayes_journal is created and for each message that passes through, the tokens in the message get written to the bayes_journal, making it grow larger and larger. I cannot see the point of it, because we have both "bayes_auto_learn 0" and "bayes_auto_expire 0". Can somebody explain why does this happen? I read a reply from the archives that said something to the effect that this does not have to do with learning. What is the point of this behavior, then? We also have set "bayes_journal_max_size 0", since "man sa-learn" explains the opportunistic syncing algorithm: ------------------------------------------------------------------------------ SpamAssassin can sync the journal and expire the DB tokens either manu- ally or opportunistically. A journal sync is due if --rebuild is passed to sa-learn (manual), or if the following is true (opportunis- tic): - bayes_journal_max_size does not equal 0 (means don't sync) - the journal file exists and either: - the journal file has a size greater than bayes_journal_max_size or - at least 1 day has passed since the last journal sync ------------------------------------------------------------------------------ However, the above statement is ambiguous to me. It could mean two things: 1. (P and Q and (R or S)) 2. (P and Q and R) or S Based on the "man sa-learn", I would go for the first interpretation, especially because it says "means don't sync". Surprisingly the manual page Mail::SpamAssassin::Conf seems to support the second interpretation: ------------------------------------------------------------------------------ bayes_journal_max_size (default: 102400) SpamAssassin will opportunistically sync the journal and the database. It will do so at least once a day, but can also sync if the file size goes above this setting, in bytes. If set to 0, the journal sync will only occur once a day. ------------------------------------------------------------------------------ This is quite confusing. According to "man sa-learn", bayes_journal_max_size meant never sync, but this one claims that the journal sync will occur once a day no matter what. Worse, whichever interpretation is correct, I suppose we still have a problem: 1. If the bayes_journal file never gets synced, then it just keeps on growing and growing. That is not desirable. 2. If the journal gets synced once a day, then the database will grow and since we have "bayes_auto_expire 0", it will keep on growing and growing. This is not desirable either. Here is another excerpt from "man sa-learn": ------------------------------------------------------------------------------ bayes_journal While SpamAssassin is scanning mails, it needs to track which tokens it uses in its calculations. So that many processes can read the databases simultaneously, but only one can write at a time, this uses a 'journal' file. ------------------------------------------------------------------------------ I cannot understand why the bayes_journal is created and written to in the first place even when no process should not have to write anything. As far as I can tell, in our configuration recognizing spam should be a matter of just reading the token databases that we have created by manually teaching SA. No process (except manual runs of sa-learn) should have the need to write or update anything. Any explanations are very welcome. What is the correct way to configure SA to accomplish our goal? Thanks a lot in advance. Regards, vmk -- **************************************************************************** "Facts are stupid things" - Ronald Reagan **************************************************************************** ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk