On Sat, 2 Feb 2013, RW wrote:

ALLOWING APPENDS
   By appends we mean the case of mail moving when the source folder is
   unknown, e.g. when you move from some other account or with tools
   like offlineimap. You should be careful with allowing APPENDs to
   SPAM folders. The reason for possibly allowing it is to allow
   not-SPAM --> SPAM transitions to work and be trained. However,
   because the plugin cannot know the source of the message (it is
   assumed to be from OTHER folder), multiple bad scenarios can happen:

   1. SPAM --> SPAM transitions cannot be recognised and are trained;
   2. TRASH --> SPAM transitions cannot be recognised and are trained;
   3. SPAM --> not-SPAM transitions cannot be recognised therefore
      training good messages will never work with APPENDs.


I presume that the plugin works by monitoring COPY commands and so
can't work properly when a move is done by FETCH-APPEND-DELETE.

For sa-learn the problem would be 3, but I don't see how that is
affected by allowing appends on the spam folder.

Yeah, all of that sounds like they're talking about non-vetted training mailboxes where the users are effectively talking directly to sa-learn.

I think I may see at least part of what they are driving at.

If one user trains a message as ham and another user who got a copy of the same message trains it as spam, who wins?

Absent some conflict-detection mechanism, the last mailbox trained (either spam or ham) wins.

As for the other two:

spam -> spam transitions don't matter, sa-learn recognises message-IDs and won't learn from the same message in the same corpus more than once (i.e. having the same message in the spam corpus multiple times does not "weight" the tokens learned from that message). So (1) may be a performance concern but it won't affect the database.

trash -> spam transition being learned is a problem how?

That latter brings up another concern for the vetted-corpora model: if a message is *removed* from a training corpora mailbox rather than reclassified, you'd have to wipe and retrain your database from scratch to remove that message's effects.

So, you need *three* vetted corpus mailboxes: spam, ham, and should-not-have-been-trained (forget). Rather than deleting a message from the ham or spam corpus mailbox you move it to the forget mailbox and the in next training pass sa-learn forgets the message and removes it from the forget mailbox. This would be some special scripting, because you can't just "sa-learn --forget" a whole mailbox.

There would also need to be an audit process to detect whether the same message_id is in both the ham and spam corpus mailboxes, so that the admin can delete (NOT forget) the incorrect classification, or forget the message if neither classification is reasonable.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  When designing software, any time you think to yourself "a user
  would never be stupid enough to do *that*", you're wrong.
-----------------------------------------------------------------------
 Today: the 10th anniversary of the loss of STS-107 Columbia

Reply via email to