Auto-learn and auto-whitelist use different scoring criteria from those used in spamassassin's spam filtering.
IMO, this is a serious mistake. In the long run, it means that the bayesian and whitelist algorithms will simply reinforce whatever errors are made by the feature-based classifier. I have seen this behaviour in my own mail system. In one case I recieved several copies of spam with the same return address. Although it was correctly classified (for filtering purposes), the whitelist classifier learned it with a lower score and eventually let the spam through. Similar errors occur with the bayesian classifier. The net result is learned false negatives and false positives over which the user has no control. I no longer use autowhitelist and I have hacked my personal version of spamassassin 2.60 to use the same criteria for autolearn as for filtering. Then I religiously use sa-learn to handle the (vanishingly small) number of misclassified messages. (I've had 3 false positives in 1700 spams, 3500 hams. Interestingly, my most recent was the acknowledgement to my subscription to this mailing list.) The rationale for spamassassin's behaviour is, I think, the fear that in unsupervised mode it will go off track. Perhaps there should be a user flag "supervised/unsupervised" that determines whether or not the same criteria are used for filtering and learning. In "supervised" mode the learner should use the same criteria as the filter. Otherwise the learner cannot be properly trained. The following is a context diff for PerMsgStatus.pm that implements my hack in spamassassin 2.60-cvs -- *** PerMsgStatus.pm.orig Mon Jun 9 12:41:10 2003 --- PerMsgStatus.pm Mon Jun 9 12:44:29 2003 *************** *** 193,204 **** # Do meta rules second-to-last $self->do_meta_tests(); - # auto-learning - $self->learn(); - # add points from Bayes, before adjusting the AWL $self->{hits} += $self->{learned_hits}; # Do AWL tests last, since these need the score to have already been # calculated $self->do_awl_tests(); --- 193,204 ---- # Do meta rules second-to-last $self->do_meta_tests(); # add points from Bayes, before adjusting the AWL $self->{hits} += $self->{learned_hits}; + # auto-learning + $self->learn(); + # Do AWL tests last, since these need the score to have already been # calculated $self->do_awl_tests(); *************** *** 272,278 **** # autolearn on and it gets really wierd. - tvd my $hits = 0; my $orig_scoreset = $self->{conf}->get_score_set(); ! if ( ($orig_scoreset & 2) == 0 ) { # we don't need to recompute dbg ("auto-learn: currently using scoreset $orig_scoreset. no need to recompute."); $hits = $self->{hits}; } --- 272,278 ---- # autolearn on and it gets really wierd. - tvd my $hits = 0; my $orig_scoreset = $self->{conf}->get_score_set(); ! if ( 1 || ($orig_scoreset & 2) == 0 ) { # we don't need to recompute dbg ("auto-learn: currently using scoreset $orig_scoreset. no need to recompute."); $hits = $self->{hits}; } *************** *** 302,309 **** } if ($isspam) { ! my $required_body_hits = 3; ! my $required_head_hits = 3; if ($self->{body_only_hits} < $required_body_hits) { dbg ("auto-learn? no: too few body hits (". --- 302,309 ---- } if ($isspam) { ! my $required_body_hits = 0; ! my $required_head_hits = 0; if ($self->{body_only_hits} < $required_body_hits) { dbg ("auto-learn? no: too few body hits (". -- Gordon V. Cormack CS Dept, University of Waterloo, Canada N2L 3G1 [EMAIL PROTECTED] http://cormack.uwaterloo.ca/cormack ------------------------------------------------------- This SF.Net email is sponsored by: INetU Attention Web Developers & Consultants: Become An INetU Hosting Partner. Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission! INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk