On 27.03.2015 19:09, RW wrote:
> On Fri, 27 Mar 2015 15:16:13 +0000
> Michael wrote:
> 
>> Hi,
>>
>> I would like automatically learn each users Bayes database in the  
>> following way:
>>
>> Do the following once a day for each user:
>> 1.) sa-learn -u username --ham ../maildir/cur
>> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
>>
>> The idea is to train the Bayes for each user without the need to
>> take care of learning Spam/Ham on their own.
>>
>> The reason for taking the "cur" folder instead of the "new" folder
>> is that I assume that the contents of these folders have already
>> been verified for false-positives/negatives by the user.
> 
> "cur" doesn't imply that the mail has been read; for that you
> need to check the seen flag in the filename, an S somewhere after the
> colon.

Yes, that's true. But if I'm right, new mails stay in "new" until the
appropriate folder in the IMAP client has been opened, right? I just
assume, if the use has some false negatives in the folder, he will
either immediately delete it or just move it into the Spam folder.

> 
> 
>> A problem that could occur is when the user always deletes all mails  
>> in .Spam/cur. Then the Bayes is only trained with Ham, but never
>> Spam. Or isn't that a problem?
> 
> Not if you tell them - then it's their fault if it doesn't work.
> Alternately you could have a separate train-spam folder and empty it
> after training.

I think it's more easy for the user if they just leave Spam in the Spam
folder for at least one day. Most of them will not move Spam into a
learn-folder.

> 
> You could also supplement spam training by autolearning only spam, e.g.
> I have:
> 
> bayes_auto_learn 1
> bayes_auto_learn_on_error 1
> bayes_auto_learn_threshold_nonspam -2000.0

But that learns spam only if its score is above 12.0. And learns no nonspam.
And then maybe the default config which auto learns spam and ham is
already the best...
My setup is already configured retrain when the user moves mail from
Inbox to Spam or from Spam to another folder.

> 
> Personally I've never seen a spam miss-trained as a ham with the
> default threshold, and sensible rule scores.
> 
> I think where some people go wrong is that they don't specify
> aggressive custom scores correctly. With autolearning it's better to
> keep conservative scores in the non-Bayes scoresets e.g.
> 
> score SOME_RULE  2 2 8 8
> 
> not
> 
> score SOME_RULE  8
> 
> There's no difference in classification, but the latter is more like to
> cause miss-training on FPs. 
> 

Reply via email to