On Thu, 29 Jan 2009 10:32:05 +0100 Matus UHLAR - fantomas <uh...@fantomas.sk> wrote:
> On 28.01.09 22:36, RW wrote: > > I just pass it though dspam and then score like this: > > > > header DS_HAM X-DSPAM-Result =~ /^(Innocent|Whitelisted)/ > > header DS_SPAM X-DSPAM-Result =~ /^Spam/ > > meta DS_HAM_FULL DS_HAM && (BAYES_00 || BAYES_05) > > > > score DS_HAM -2.5 > > score DS_SPAM 21.0 > > score DS_HAM_FULL -15.0 > > don't you trust dspam too much? No, it's 9 points below the 30 point threshold, and DSPAM FP's once in a blue moon. > > > score BAYES_00 -2.5 > > score BAYES_05 -1.5 > > Why do you do this? > 1. you are assigning scores for BAYES even if BAYES is turned off Are any of the BAYES_* rules hit if BAYES is turned-off? > 2. BAYES_00 has score -2.599 when network rules are on, you are > lowering effectiveness - Not really, BAYES_00 is left independently scored just to allow a little extra safety-margin for the unlikely combination DS_SPAM + BAYES_00, aside from this combination BAYES_00 will always trip the DS_HAM_FULL meta rule. It's set so that BAYES_00 + DS_HAM + DS_HAM_FULL = -20 which eliminates the most serious problem in SA whereby a large legitimate mail sometimes accumulates a lot of textual hits, and runs up a huge score. The precise value doesn't matter all that much, but -20 allows me to see at a glance how many other points I'm getting. I don't regard the default BAYES scores to be all that sensible because the scoring algorithm tolerates a very high level of FP's. I think most people that use sa-learn properly would see better results if they dropped BAYES_00 to -10 or lower. Personally I've never seen a single spam hit BAYES_00. > > However, thinking about it a bit more, I think that the only real > > problem is that ham that scores between 0.1 and 5.0 > > wont be learned as ham, and I can fix that by moving the autolearn > > threshold to up to 4.9. > > So you are willing to feed _anything_ to BAYES filters, which means > that without manual intervention, FP and FN rate will both quickly > increase That's a bit alarmist, worst case is that the 4.9 threshold causes a small number of spams that are aren't recognised by dspam to be *temporarily* learned as ham in SA until I correct it. That sounds pretty good to me, if a new type of mail starts to arrive, that's not recognised by DSPAM, I'd like SA to behave like DSPAM does and give it the benefit of the doubt. The normal way to train DSPAM is to let it autolearn everything, and correct it when it's wrong. Once DSPAM has learned from reasonable sized corpora it can stand a few percent miss-classifications in new mail because they're washed-out by correct classifications. It's only if you stop correcting that you have a real problem, and even then it will simply catch less spam, due to the strong bias against FP's.