[SAtalk] autolearn/autowhitelist misguided

Gordon Cormack Sat, 21 Jun 2003 18:03:30 -0700

Auto-learn and auto-whitelist use different scoring criteria from those
used in spamassassin's spam filtering.


IMO, this is a serious mistake.  In the long run, it means that the
bayesian and whitelist algorithms will simply reinforce whatever errors
are made by the feature-based classifier.

I have seen this behaviour in my own mail system.  In one case I recieved
several copies of spam with the same return address.  Although it was
correctly classified (for filtering purposes), the whitelist classifier
learned it with a lower score and eventually let the spam through.

Similar errors occur with the bayesian classifier.  The net result is
learned false negatives and false positives over which the user has no 
control.

I no longer use autowhitelist and I have hacked my personal version of
spamassassin 2.60 to use the same criteria for autolearn as for filtering.
Then I religiously use sa-learn to handle the (vanishingly small) number
of misclassified messages.  (I've had 3 false positives in 1700 spams,
3500 hams.  Interestingly, my most recent was the acknowledgement to
my subscription to this mailing list.)

The rationale for spamassassin's behaviour is, I think, the fear that
in unsupervised mode it will go off track.  Perhaps there should be a user
flag "supervised/unsupervised" that determines whether or not the same
criteria are used for filtering and learning.  In "supervised" mode 
the learner should use the same criteria as the filter.  Otherwise the
learner cannot be properly trained.

The following is a context diff for PerMsgStatus.pm that implements my hack
in spamassassin 2.60-cvs

--

*** PerMsgStatus.pm.orig        Mon Jun  9 12:41:10 2003
--- PerMsgStatus.pm     Mon Jun  9 12:44:29 2003
***************
*** 193,204 ****
      # Do meta rules second-to-last
      $self->do_meta_tests();
  
-     # auto-learning
-     $self->learn();
- 
      # add points from Bayes, before adjusting the AWL
      $self->{hits} += $self->{learned_hits};
  
      # Do AWL tests last, since these need the score to have already been
      # calculated
      $self->do_awl_tests();
--- 193,204 ----
      # Do meta rules second-to-last
      $self->do_meta_tests();
  
      # add points from Bayes, before adjusting the AWL
      $self->{hits} += $self->{learned_hits};
  
+     # auto-learning
+     $self->learn();
+ 
      # Do AWL tests last, since these need the score to have already been
      # calculated
      $self->do_awl_tests();
***************
*** 272,278 ****
    # autolearn on and it gets really wierd.  - tvd
    my $hits = 0;
    my $orig_scoreset = $self->{conf}->get_score_set();
!   if ( ($orig_scoreset & 2) == 0 ) { # we don't need to recompute
      dbg ("auto-learn: currently using scoreset $orig_scoreset.  no need to 
recompute.");
      $hits = $self->{hits};
    }
--- 272,278 ----
    # autolearn on and it gets really wierd.  - tvd
    my $hits = 0;
    my $orig_scoreset = $self->{conf}->get_score_set();
!   if ( 1 || ($orig_scoreset & 2) == 0 ) { # we don't need to recompute
      dbg ("auto-learn: currently using scoreset $orig_scoreset.  no need to 
recompute.");
      $hits = $self->{hits};
    }
***************
*** 302,309 ****
    }
  
    if ($isspam) {
!     my $required_body_hits = 3;
!     my $required_head_hits = 3;
  
      if ($self->{body_only_hits} < $required_body_hits) {
        dbg ("auto-learn? no: too few body hits (".
--- 302,309 ----
    }
  
    if ($isspam) {
!     my $required_body_hits = 0;
!     my $required_head_hits = 0;
  
      if ($self->{body_only_hits} < $required_body_hits) {
        dbg ("auto-learn? no: too few body hits (".

-- 
Gordon V. Cormack     CS Dept, University of Waterloo, Canada N2L 3G1
[EMAIL PROTECTED]            http://cormack.uwaterloo.ca/cormack


-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] autolearn/autowhitelist misguided

Reply via email to