On 10/26, Alexandre Boyer wrote: > Well, discouraged was implicit (as is the fact that every admin is
I don't think there's anything implicit about it being discouraged to use a threshold below 5. There are lots of local changes which are far less likely to cause problems, and encouraged. > The SA rules scores are computed based on the mass-checks, from the > project and, to some extend, from contributors. A good question is: how > many contributors really give a feedback on the mass-checks? This is public information, although not very explicit. On http://ruleqa.spamassassin.org/ look in the green box, it lists all the corpora included: axb-coi-bulk axb-fraud axb-generic axb-ham-misc axb-sa-users axb-woas bb-guenther_fraud bb-jhardin bb-jhardin_fraud bb-jm bb-kmcgrail bb-zmi bpoliakoff danmcdonald darxus grenier jarif kpg-gah mas zmi The ones starting with "bb-" are uploaded emails, instead of running masscheck locally, it's run centrally. Other than that, the prefixes are each different contribtors. So: axb, guenther, jhardin, jm, kmcgrail, zmi, bpoliakoff, danmcdonald, darxus, grenier, kpg-gah, kpg, mas, zmi. 14 masscheck contributors. We'd probably benefit a lot by significantly increasing that, which is why I mention it somewhat often. > This is something I do not know, but the fewer they are, the greater the > bias is. Bias in spam and ham samples. Emails reaching my servers are > different from yours and from each and every SA users. Absolutely. > Unless everybody on earth run a nightly mass-check and report results to > SA project for it to compute a "world wide" scoring, there is a bias. At > least this is my understanding, may be I'm wrong, please correct me if so. No, you're totally right. We do what we can with what we have, and I think we do pretty darn good. But we could do better with more data. > For example, I'm in the process of learning to use mass-check to > contribute back to SA (which implies a lot of hard work, simply to build > and maintain valid ham/spam corpora, use mass-check, then hit-freq, then > fp-fn-stat, I'm not even close to understand how to compute a re-score. I don't know what fp-fn-stat is. You don't need to computer a re-score - that's part of what is done with your maccheck data after you upload it. There's a reletively recently created mailing list specifically for helping people with this stuff, to which I believe you automatically get subscribed when you get a masscheck account: http://wiki.apache.org/spamassassin/MailingLists#RuleQA If you're having difficulty with it, the docs probably need improvement, so do let us know. Your mention of fp-fn-stat makes me think you may have veered a little too far from https://wiki.apache.org/spamassassin/NightlyMassCheck > with this, I'm not sure my contribution would be sufficient to make SA > scores to be closer to my email traffic reality. I think it would. For example, I'm sure, from what you've posted, that you have enough examples of hams that hit DEAR_SOMETHING that the score of it would drop significantly. > Do you have any stat about how many contributors are giving a feedback > on the masscheck? and about their geographical location? I'm just asking > because I was not able to find this kind of information anywhere. I believe they're almost all in the US, primarily English speakers. That's bad. -- "You only truly own what you can carry at a dead run." - 14th & 15th century Landsknechts http://www.ChaosReigns.com