On Tue, 31 Mar 2009 18:34:05 +0200 (CEST) "Benny Pedersen" <m...@junc.org> wrote:
> > On Tue, March 31, 2009 17:53, RW wrote: > > I think it would be nice if SA could handle this automatically > > e.g. if ham is over-represented then only autolearn ham where > > p>0.001, and vice versa. > > it already does I'm not sure what you are saying here, but what I was suggesting was that autolearning be modified to maintain the ratio of spam:ham within reasonable limits. That doesn't appear to be the case when people are ending up with 10:1 ratios. > > At the moment the only way of tweaking this is to vary the > > thresholds, which is about the worst possible way of doing it. > > why ? Because it distorts the databases. If you push the ham autolearn threshold down to learn less ham, it becomes much more selective. For example you can be learning all the ham from domain A, but none from domain B, but you're still autolearning spam from domain B. And in general the closer mail scores to 5.0 the more valuable it is to learn it. Maintaining the ham:spam ratio by discarding the more valuable candidates for learning doesn't make much sense to me.