On 2/1/2013 7:58 PM, John Hardin wrote:
> On Sat, 2 Feb 2013, RW wrote:
> 
>> ALLOWING APPENDS
>>    By appends we mean the case of mail moving when the source folder is
>>    unknown, e.g. when you move from some other account or with tools
>>    like offlineimap. You should be careful with allowing APPENDs to
>>    SPAM folders. The reason for possibly allowing it is to allow
>>    not-SPAM --> SPAM transitions to work and be trained. However,
>>    because the plugin cannot know the source of the message (it is
>>    assumed to be from OTHER folder), multiple bad scenarios can happen:
>>
>>    1. SPAM --> SPAM transitions cannot be recognised and are trained;
>>    2. TRASH --> SPAM transitions cannot be recognised and are trained;
>>    3. SPAM --> not-SPAM transitions cannot be recognised therefore
>>       training good messages will never work with APPENDs.
>>
>>
>> I presume that the plugin works by monitoring COPY commands and so
>> can't work properly when a move is done by FETCH-APPEND-DELETE.
>>
>> For sa-learn the problem would be 3, but I don't see how that is
>> affected by allowing appends on the spam folder.
> 
> Yeah, all of that sounds like they're talking about non-vetted training
> mailboxes where the users are effectively talking directly to sa-learn.
> 
> I think I may see at least part of what they are driving at.
> 
> If one user trains a message as ham and another user who got a copy of
> the same message trains it as spam, who wins?
> 
> Absent some conflict-detection mechanism, the last mailbox trained
> (either spam or ham) wins.
> 
> As for the other two:
> 
> spam -> spam transitions don't matter, sa-learn recognises message-IDs
> and won't learn from the same message in the same corpus more than once
> (i.e. having the same message in the spam corpus multiple times does not
> "weight" the tokens learned from that message). So (1) may be a
> performance concern but it won't affect the database.
> 
> trash -> spam transition being learned is a problem how?
> 
> That latter brings up another concern for the vetted-corpora model: if a
> message is *removed* from a training corpora mailbox rather than
> reclassified, you'd have to wipe and retrain your database from scratch
> to remove that message's effects.
> 
> So, you need *three* vetted corpus mailboxes: spam, ham, and
> should-not-have-been-trained (forget). Rather than deleting a message
> from the ham or spam corpus mailbox you move it to the forget mailbox
> and the in next training pass sa-learn forgets the message and removes
> it from the forget mailbox. This would be some special scripting,
> because you can't just "sa-learn --forget" a whole mailbox.
> 
> There would also need to be an audit process to detect whether the same
> message_id is in both the ham and spam corpus mailboxes, so that the
> admin can delete (NOT forget) the incorrect classification, or forget
> the message if neither classification is reasonable.
> 

You reveal some crucial information with regard to corpora management
here, John.

I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".

It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?

With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome

> because you can't just "sa-learn --forget" a whole mailbox.

Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?

Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:

1.) I agree with the end-user's classification.

2.) I disagree with the end-user's classification.
        a.) Because the message was submitted as ham but is really spam (or
vice versa)
        b.) Because neither classification is reasonable

In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.

In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?

I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.

Thanks again!

-Ben

Reply via email to