Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

John Hardin Fri, 01 Feb 2013 16:59:00 -0800

On Sat, 2 Feb 2013, RW wrote:

ALLOWING APPENDS
   By appends we mean the case of mail moving when the source folder is
   unknown, e.g. when you move from some other account or with tools
   like offlineimap. You should be careful with allowing APPENDs to
   SPAM folders. The reason for possibly allowing it is to allow
   not-SPAM --> SPAM transitions to work and be trained. However,
   because the plugin cannot know the source of the message (it is
   assumed to be from OTHER folder), multiple bad scenarios can happen:


   1. SPAM --> SPAM transitions cannot be recognised and are trained;
   2. TRASH --> SPAM transitions cannot be recognised and are trained;
   3. SPAM --> not-SPAM transitions cannot be recognised therefore
      training good messages will never work with APPENDs.


I presume that the plugin works by monitoring COPY commands and so
can't work properly when a move is done by FETCH-APPEND-DELETE.

For sa-learn the problem would be 3, but I don't see how that is
affected by allowing appends on the spam folder.

Yeah, all of that sounds like they're talking about non-vetted trainingmailboxes where the users are effectively talking directly to sa-learn.


I think I may see at least part of what they are driving at.

If one user trains a message as ham and another user who got a copy of thesame message trains it as spam, who wins?

Absent some conflict-detection mechanism, the last mailbox trained (eitherspam or ham) wins.


As for the other two:

spam -> spam transitions don't matter, sa-learn recognises message-IDs andwon't learn from the same message in the same corpus more than once (i.e.having the same message in the spam corpus multiple times does not"weight" the tokens learned from that message). So (1) may be aperformance concern but it won't affect the database.


trash -> spam transition being learned is a problem how?

That latter brings up another concern for the vetted-corpora model: if amessage is *removed* from a training corpora mailbox rather thanreclassified, you'd have to wipe and retrain your database from scratch toremove that message's effects.

So, you need *three* vetted corpus mailboxes: spam, ham, andshould-not-have-been-trained (forget). Rather than deleting a message fromthe ham or spam corpus mailbox you move it to the forget mailbox and thein next training pass sa-learn forgets the message and removes it from theforget mailbox. This would be some special scripting, because you can'tjust "sa-learn --forget" a whole mailbox.

There would also need to be an audit process to detect whether the samemessage_id is in both the ham and spam corpus mailboxes, so that the admincan delete (NOT forget) the incorrect classification, or forget themessage if neither classification is reasonable.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  When designing software, any time you think to yourself "a user
  would never be stupid enough to do *that*", you're wrong.
-----------------------------------------------------------------------
 Today: the 10th anniversary of the loss of STS-107 Columbia

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Reply via email to