Kevin,
I did skim bug 5503 earlier, but didn't understand it at first.
Knowing the history now, it makes a little more sense, although I'm
still fuzzy on why the value of "3" for the body and head points is
important.
It might be nice to have local.cf directives to allow admins to be
able to affect the $required_body_points and $required_head_points in
AutoLearnThreshold.pm. That way, admins could tune tweak this behavior
to allow more/less auto-learning... (i.e. 1 body points, and 2.5 head
points) Thoughts?
As for Bayes strategies (and without starting a flamewar), we just
started implementing an IMAP folder in everyone's mailbox called "Learn
As Spam", that gets processed through "sa-learn --spam". It sounds like
we may need to leave auto-learning to SA's defaults, and ask users to
put e-mails in "Learn As Spam" and "Learn As Non-Spam" folders. Perhaps
relying on out-of-the-box auto-learning, and tempering Bayes with
user-based learning, may yield positive results.
Thanks again, Kevin and RW, for your input.
Sincerely,
John
On 11/05/14 06:40, Kevin A. McGrail wrote:
On 11/4/2014 6:06 PM, John Woods wrote:
Everyone,
We're having problems with auto learning on v3.4.0 that we aren't
having on v.3.3.2. The number of spam e-mails being auto-learned has
dropped significantly, and the amount of spam being let through
(false negatives) is higher as well. After looking through the
wiki and the code, I'm pretty sure this change is related to the rule
that says you must have 3 "body only" points and 3 "header only"
points, which are hardcoded values in
Mail::SpamAssassin::Plugin::AutoLearnThreshold. In 3.3.2, it looks
like body-points equals the head-points, and in 3.4.0, they are changed.
You are correct. There were changes and bugs found in the logic that
were resolved on 3.4.0. See
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5503
I've got a few questions:
1) How does SpamAssassin derive and sum the "body_only" and
"head_only" points? It doesn't look like the body_only points
correspond to any scores from individual tests.
There is a test_type flag. It was sometimes lost in previous parsing
of messages.
2) How can we affect the configuration, to increase the number of
spam e-mails being auto-learned?
3) Instead, do we need to completely change our strategy for how
we're using Bayes?
I will leave Bayes comments to other experts but in general, I believe
you will find that some sort of NON automated learning will produce
better results. My concern with auto-learning is you are just
self-perpetuating any flaws in the current classification not really
helping to stop new and different spam. I will likely setup a
flamewar if I continue discussing Bayes.
Perhaps you can buy a six pack for AXB and convince him to add his
$0.04 on Bayes. He's the resident expert.
regards,
KAM