Re: sa-learn and "Caught" spams

Daniel Staal Sun, 01 Oct 2006 10:33:28 -0700

--As of September 28, 2006 11:05:35 AM -0700, Kelson is alleged to havesaid:

Daniel Staal wrote:

Depends on the setup.  For instance, given the explanations above, I'll
start a system to automatically learn from my 'checkspam' folder, but
not my 'highspam' folder.  I have procmail automatically sort my spam by
score, so I can pay extra attention to low-scoring spam.  (Which is more
likely to be ham which was misplaced than the high-scoring spam.)


So, since I *already* have them separated out, I can avoid the
double-check.  ;)


But the final score alone doesn't determine whether something gets
autolearned.

As Matt pointed out, there are a number of different factors, including
the mix of head/body tests and the current Bayes score -- and it acts on
what the score would have been if Bayes had been disabled.

So unless you've filtered on the "autolearn=(ham|spam|no)" tag in the
X-Spam-Status header, you could be missing some high-scoring spam that
hasn't already been learned.

You could probably filter your training folder to remove any messages
where X-Spam-Status contains "autolearn=spam" (assuming, of course, that
your server takes full control of that header).  That should be
relatively fast and cut down on the resources used to identify duplicates.


--As for the rest, it is mine.

Just as an update, since I'm seeing something interesting...

As an experiment, I set procmail to copy all the 'highspam' that I get that*doesn't* get autolearned to a separate folder, and have been attempting totrain on that folder daily.

I say 'attempting' because despite these *only* being the emails that had'autolearn=no' and were definitely spam, in three days sa-learn has yet tosee any useful tokens in one of these messages. Generally, uponexamination, these messages already are receiving bayes scores of 99% orbetter, so it appears that the tokens found are already fully scored.(Though not all of them have had such high bayes scores.)

I'll be keeping it up for a while; three days isn't much of a test, afterall. But at this point it appears extra training on messages with scoresover 10 (my high-spam cut-off) doesn't actually do anything. All relevanttokens are already learned, at least in a fully-trained and well-tunedsystem.

Spam emails scored less than 10 do have a number of messages each day thathave useful tokens, on my system. Which is to be expected, after all.


Just thought this might be of interest.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: sa-learn and "Caught" spams

Reply via email to