Masscheck Re: Question about rule: 2.0 DEAR_SOMETHING BODY: Contains 'Dear (something)'

darxus Fri, 26 Oct 2012 09:19:16 -0700

On 10/26, Alexandre Boyer wrote:
> Well, discouraged was implicit (as is the fact that every admin is


I don't think there's anything implicit about it being discouraged to use a
threshold below 5.  There are lots of local changes which are far less
likely to cause problems, and encouraged.

> The SA rules scores are computed based on the mass-checks, from the
> project and, to some extend, from contributors. A good question is: how
> many contributors really give a feedback on the mass-checks?

This is public information, although not very explicit.
On http://ruleqa.spamassassin.org/ look in the green box, it lists all the
corpora included:

  axb-coi-bulk
  axb-fraud
  axb-generic
  axb-ham-misc
  axb-sa-users
  axb-woas
  bb-guenther_fraud
  bb-jhardin
  bb-jhardin_fraud
  bb-jm
  bb-kmcgrail
  bb-zmi
  bpoliakoff
  danmcdonald
  darxus
  grenier
  jarif
  kpg-gah
  mas
  zmi

The ones starting with "bb-" are uploaded emails, instead of running
masscheck locally, it's run centrally.  Other than that, the prefixes are
each different contribtors.  So:

axb, guenther, jhardin, jm, kmcgrail, zmi, bpoliakoff, danmcdonald,
darxus, grenier, kpg-gah, kpg, mas, zmi.

14 masscheck contributors.  We'd probably benefit a lot by significantly
increasing that, which is why I mention it somewhat often.

> This is something I do not know, but the fewer they are, the greater the
> bias is. Bias in spam and ham samples. Emails reaching my servers are
> different from yours and from each and every SA users.

Absolutely.

> Unless everybody on earth run a nightly mass-check and report results to
> SA project for it to compute a "world wide" scoring, there is a bias. At
> least this is my understanding, may be I'm wrong, please correct me if so.

No, you're totally right.  We do what we can with what we have, and I think
we do pretty darn good.  But we could do better with more data.  

> For example, I'm in the process of learning to use mass-check to
> contribute back to SA (which implies a lot of hard work, simply to build
> and maintain valid ham/spam corpora, use mass-check, then hit-freq, then
> fp-fn-stat, I'm not even close to understand how to compute a re-score.

I don't know what fp-fn-stat is.  You don't need to computer a re-score -
that's part of what is done with your maccheck data after you upload it.

There's a reletively recently created mailing list specifically for helping
people with this stuff, to which I believe you automatically get subscribed
when you get a masscheck account:
http://wiki.apache.org/spamassassin/MailingLists#RuleQA

If you're having difficulty with it, the docs probably need improvement, so
do let us know.


Your mention of fp-fn-stat makes me think you may have veered a little too
far from https://wiki.apache.org/spamassassin/NightlyMassCheck

> with this, I'm not sure my contribution would be sufficient to make SA
> scores to be closer to my email traffic reality.

I think it would.  For example, I'm sure, from what you've posted, that you
have enough examples of hams that hit DEAR_SOMETHING that the score of it
would drop significantly.

> Do you have any stat about how many contributors are giving a feedback
> on the masscheck? and about their geographical location? I'm just asking
> because I was not able to find this kind of information anywhere.

I believe they're almost all in the US, primarily English speakers.  That's
bad.

-- 
"You only truly own what you can carry at a dead run."
- 14th & 15th century Landsknechts
http://www.ChaosReigns.com

Masscheck Re: Question about rule: 2.0 DEAR_SOMETHING BODY: Contains 'Dear (something)'

Reply via email to