--As of September 28, 2006 11:05:35 AM -0700, Kelson is alleged to have said:

Daniel Staal wrote:
Depends on the setup.  For instance, given the explanations above, I'll
start a system to automatically learn from my 'checkspam' folder, but
not my 'highspam' folder.  I have procmail automatically sort my spam by
score, so I can pay extra attention to low-scoring spam.  (Which is more
likely to be ham which was misplaced than the high-scoring spam.)

So, since I *already* have them separated out, I can avoid the
double-check.  ;)

But the final score alone doesn't determine whether something gets
autolearned.

As Matt pointed out, there are a number of different factors, including
the mix of head/body tests and the current Bayes score -- and it acts on
what the score would have been if Bayes had been disabled.

So unless you've filtered on the "autolearn=(ham|spam|no)" tag in the
X-Spam-Status header, you could be missing some high-scoring spam that
hasn't already been learned.

You could probably filter your training folder to remove any messages
where X-Spam-Status contains "autolearn=spam" (assuming, of course, that
your server takes full control of that header).  That should be
relatively fast and cut down on the resources used to identify duplicates.

--As for the rest, it is mine.

Just as an update, since I'm seeing something interesting...

As an experiment, I set procmail to copy all the 'highspam' that I get that *doesn't* get autolearned to a separate folder, and have been attempting to train on that folder daily.

I say 'attempting' because despite these *only* being the emails that had 'autolearn=no' and were definitely spam, in three days sa-learn has yet to see any useful tokens in one of these messages. Generally, upon examination, these messages already are receiving bayes scores of 99% or better, so it appears that the tokens found are already fully scored. (Though not all of them have had such high bayes scores.)

I'll be keeping it up for a while; three days isn't much of a test, after all. But at this point it appears extra training on messages with scores over 10 (my high-spam cut-off) doesn't actually do anything. All relevant tokens are already learned, at least in a fully-trained and well-tuned system.

Spam emails scored less than 10 do have a number of messages each day that have useful tokens, on my system. Which is to be expected, after all.

Just thought this might be of interest.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Reply via email to