training SpamAssassin without updating bayes*

Gabriel Wachman Sat, 04 Mar 2006 18:54:10 -0800

A colleague and I are writing a paper about a spam filter he developed.
We'd like to compare it against various open source filters, including
SpamAssassin. The methodology we are using is to train the filter on a
set of messages, and then test it on an independent set of messages. The
key is that the filter cannot update itself at all after training.


In my user_prefs:
bayes_auto_learn        0
bayes_learn_during_report       0
bayes_path SOME_PATH

During training I run:
sa-learn --dbpath $WORKDIR --ham $DATADIR/$message_dir
(likewise for spam)

During testing I run:
spamassassin -t -p $PREFSPATH $DATADIR/$message_dir

I'm running several testing and training runs, so for each one I specify
a different database (by setting "SOME_PATH" appropriately and
specifying that "user_prefs" using the -p switch), hence the variables
for certain command-line arguments. The matching testing run for a given
training run must read the bayes_* files from that training run.

During testing, I can see spamassassin create a "bayes_journal" file and
write to it continuously. I understand this is spamassassin's way of
storing its updates to bayes_* temporarily until the updates are merged.
My concern is that it's using bayes_journal in addition to bayes_toks
and bayes_seen during testing, but I just want it to use the bayes_toks
and bayes_seen generating during training.

Can someone tell me how to run spamassassin in testing mode, without
making any updates or doing any learning, but only classifying messages?

Thank you,
Gabriel

training SpamAssassin without updating bayes*

Reply via email to