On Wed, 21 Mar 2012 10:06:58 +0100
Matus UHLAR - fantomas wrote:

> >On Fri, 9 Mar 2012 16:38:49 +0100
> >Matus UHLAR - fantomas wrote:

> >No, it isn't. Bayes is a statistical filter it needs to learn a lot
> >of diverse  spam and ham to reach it's optimum accuracy. It's been
> >demonstrated on Bogofilter that "train-on-everything" outperforms
> >"train-on-error" on the same corpora. They both end-up with similar
> >accuracy, but "train-on-everything" gets there very much faster.
> >Bogofilter is almost identical to BAYES; they just differ in the
> >details of the tokenizer and the Robinson parameters.
> >
> >Training on SA miss-classification is going to be glacially slow.
> 
> there are two problems when requiring users to manually learn on 
> everythhing.

I'm not advocating that users be forced to do anything, my preference
is to allow them to choose what they want to train on. Whether or not
your script chooses to learn everything they submit is it different
matter.

> - it's more work to implement

In general it's easier to implement explicit learn-spam and learn-ham
folders than it is to keep track of what is moved in and out of a spam
folder.

> - it's more work for users to do the training.

Not really, If they choose to learn just the spamassassin
miss-classifications it's the same work, but they have option to learn
more - in particular important ham. Personally, if I saw that
important mail was hitting BAYES_50, I'd feel pretty frustated
sitting  around waiting for FPs to train Bayes, knowing that those
FPs are avoidable.

> Note that the main goal of spam filters is to save people some work, 
> not to give it to them. The users will want to to the "train only on 
> misfires", and the sooner they get there, the better.

On Wed, 21 Mar 2012 08:38:24 -0400
Michael Scheidell wrote:

> On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote:
> > there are two problems when requiring users to manually learn on 
> > everythhing.
> > - it's more work to implement
> > - it's more work for users to do the training.
> and, if 95% of the users are using microsoft exchange, exchange will 
> horribly mangle the headers, and the body, even changing the actual 
> encoding.
> so, what would you manually learn?

That applies to any form of manual user training, so it's a different
issue.

I don't know the details of what exchange does, but I suspect it matters
less than you think because most of the information used by Bayes
is in normalized form. 

Reply via email to