Michael Monnerie wrote: > On Dienstag, 9. Mai 2006 23:14 Bowie Bailey wrote: > > When I look at the overall stats, bayes does pretty good: > > RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM > > ------------------------------------------------------------ > > 6 BAYES_99 26754 4.19 44.49 67.00 3.06 > > 3% HAM hits for BAYES_99 is horrible, not good. It's the FP that > should make you alert.
True enough. But no complaints so far. I'm not sure how many of my clients are even taking advantage of the spam markup. > > But when I do it for only our domain (which is where all the manual > > training happens), it hits less ham, but less spam as well: > > RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM > > ------------------------------------------------------------ > > 8 BAYES_99 4649 3.29 33.41 54.64 0.20 > > At least much better FP rate, by a factor of 15! > > > Just my personal email address (which is trained aggressively) gets > > very few ham hits (partly because I lowered my threshold to 4.0), > > but less spam than overall: RANK RULE NAME COUNT %OFRULES > > %OFMAIL %OFSPAM %OFHAM > > ------------------------------------------------------------ 5 > > BAYES_99 1643 3.08 27.05 65.72 0.08 > > Again the FPs reduced... Of course, it's being constantly trained and the spam threshold is lower. I am curious why I don't get more spam hits with a well-trained database. > > And then when I modify sa-stats to exclude our domain, I find that > > our customers (who are trained exclusively with autolearn) seem to > > do better than us: RANK RULE NAME COUNT %OFRULES %OFMAIL > > %OFSPAM %OFHAM > > ------------------------------------------------------------ 6 > > BAYES_99 22105 4.44 47.83 70.35 4.11 > > No, 4% FPs is nothing you should be happy with. > > > Based on these results, it almost seems like the more training Bayes > > gets, the worse it does! > > But remember that sa-stats can never tell if that HAM/SPAM are really > such, it just tells you what it *believed* was HAM/SPAM. Right. That's what I was referring to below. > > Are these anomolies just an artifact of sa-stats relying on SA to > > judge ham and spam properly? Can these numbers be trusted at all if > > my users don't reliably report false negatives and positives? > > As I said on the other thread: Be very careful what you feed to bayes. > Try to find those 4% of FPs, and if they are really FPs. Maybe your SA > made the mistakes because you don't have enough rules to detect all > SPAMs. The group with 4% false positives is trained exclusively through autolearn. There is no facility for manual training with those accounts. If I follow the false positives, it lines up with expectations. The more manual training in the group, the lower the false positives. Why don't I see a similar trend with the spam hits? -- Bowie