--As of September 28, 2006 11:05:35 AM -0700, Kelson is alleged to have
said:
Daniel Staal wrote:
Depends on the setup. For instance, given the explanations above, I'll
start a system to automatically learn from my 'checkspam' folder, but
not my 'highspam' folder. I have procmail automatically sort my spam by
score, so I can pay extra attention to low-scoring spam. (Which is more
likely to be ham which was misplaced than the high-scoring spam.)
So, since I *already* have them separated out, I can avoid the
double-check. ;)
But the final score alone doesn't determine whether something gets
autolearned.
As Matt pointed out, there are a number of different factors, including
the mix of head/body tests and the current Bayes score -- and it acts on
what the score would have been if Bayes had been disabled.
So unless you've filtered on the "autolearn=(ham|spam|no)" tag in the
X-Spam-Status header, you could be missing some high-scoring spam that
hasn't already been learned.
You could probably filter your training folder to remove any messages
where X-Spam-Status contains "autolearn=spam" (assuming, of course, that
your server takes full control of that header). That should be
relatively fast and cut down on the resources used to identify duplicates.
--As for the rest, it is mine.
Just as an update, since I'm seeing something interesting...
As an experiment, I set procmail to copy all the 'highspam' that I get that
*doesn't* get autolearned to a separate folder, and have been attempting to
train on that folder daily.
I say 'attempting' because despite these *only* being the emails that had
'autolearn=no' and were definitely spam, in three days sa-learn has yet to
see any useful tokens in one of these messages. Generally, upon
examination, these messages already are receiving bayes scores of 99% or
better, so it appears that the tokens found are already fully scored.
(Though not all of them have had such high bayes scores.)
I'll be keeping it up for a while; three days isn't much of a test, after
all. But at this point it appears extra training on messages with scores
over 10 (my high-spam cut-off) doesn't actually do anything. All relevant
tokens are already learned, at least in a fully-trained and well-tuned
system.
Spam emails scored less than 10 do have a number of messages each day that
have useful tokens, on my system. Which is to be expected, after all.
Just thought this might be of interest.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------