On Sat, 2 Feb 2013, RW wrote:
ALLOWING APPENDS
By appends we mean the case of mail moving when the source folder is
unknown, e.g. when you move from some other account or with tools
like offlineimap. You should be careful with allowing APPENDs to
SPAM folders. The reason for possibly allowing it is to allow
not-SPAM --> SPAM transitions to work and be trained. However,
because the plugin cannot know the source of the message (it is
assumed to be from OTHER folder), multiple bad scenarios can happen:
1. SPAM --> SPAM transitions cannot be recognised and are trained;
2. TRASH --> SPAM transitions cannot be recognised and are trained;
3. SPAM --> not-SPAM transitions cannot be recognised therefore
training good messages will never work with APPENDs.
I presume that the plugin works by monitoring COPY commands and so
can't work properly when a move is done by FETCH-APPEND-DELETE.
For sa-learn the problem would be 3, but I don't see how that is
affected by allowing appends on the spam folder.
Yeah, all of that sounds like they're talking about non-vetted training
mailboxes where the users are effectively talking directly to sa-learn.
I think I may see at least part of what they are driving at.
If one user trains a message as ham and another user who got a copy of the
same message trains it as spam, who wins?
Absent some conflict-detection mechanism, the last mailbox trained (either
spam or ham) wins.
As for the other two:
spam -> spam transitions don't matter, sa-learn recognises message-IDs and
won't learn from the same message in the same corpus more than once (i.e.
having the same message in the spam corpus multiple times does not
"weight" the tokens learned from that message). So (1) may be a
performance concern but it won't affect the database.
trash -> spam transition being learned is a problem how?
That latter brings up another concern for the vetted-corpora model: if a
message is *removed* from a training corpora mailbox rather than
reclassified, you'd have to wipe and retrain your database from scratch to
remove that message's effects.
So, you need *three* vetted corpus mailboxes: spam, ham, and
should-not-have-been-trained (forget). Rather than deleting a message from
the ham or spam corpus mailbox you move it to the forget mailbox and the
in next training pass sa-learn forgets the message and removes it from the
forget mailbox. This would be some special scripting, because you can't
just "sa-learn --forget" a whole mailbox.
There would also need to be an audit process to detect whether the same
message_id is in both the ham and spam corpus mailboxes, so that the admin
can delete (NOT forget) the incorrect classification, or forget the
message if neither classification is reasonable.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
When designing software, any time you think to yourself "a user
would never be stupid enough to do *that*", you're wrong.
-----------------------------------------------------------------------
Today: the 10th anniversary of the loss of STS-107 Columbia