Re: Concerned with scores for from rfc-ignorant.org

John Andersen Fri, 13 Oct 2006 01:34:50 -0700

On Friday 13 October 2006 00:23, Justin Mason wrote:
> John D. Hardin writes:
> > On Thu, 12 Oct 2006, Kurt Fitzner wrote:
> > > For the purposes of SpamAssassin, it only matters if spam is
> > > filtered and ham is let through.  As I keep harping on, I don't
> > > think it's SpamAssassin's job to crusade for abuse@/postmaster@
> > > compliance.
> > >
> > > The rules in question almost by definition don't address spam,
> > > they address whether people are peeved at how hard it is to
> > > contact a domain's postmaster.  Which is why I dispute the score
> > > attached to them.
> >
> > Those rules *do* address spam. As was explained, across the entire
> > corpus the RFCI results are a reliable enough spam indicator to
> > justify the score. If the scores weren't based on masscheck results
> > then you might be able to argue that they were assigned on an
> > emotional basis to forward a given agenda.
>
> Exactly, thanks John!
>
> > > The corpus for ham is almost four years old.  Does it address the
> > > current email volumes that are sent today?  I downloaded and
> > > checked the latest hard_ham, and it has zero emails sent from
> > > yahoo.com.
> >
> > THAT is a valid basis for objection.
> >
> > > If you want to have and justify rules that target RFC compliance,
> > > then there needs to be justification that outgoing spam volumes
> > > and RFC 2821 compliance are linked.  I make the claim that a major
> > > source of ham email is getting dangerously high spam scores and
> > > that there is little to nothing in the corpus that is aimed at
> > > preventing this particular rule from malfunctioning.
> >
> > ...except your posts so far have been far more ranting about RFCI
> > itself rather than suggesting the corpus is stale.
> >
> > The corpus may indeed be stale. If that's the case then the problem
> > extends far beyond the RFCI rules as the base scores for *all* rules
> > are based on the corpus.
> >
> > However,
> >
> >   http://wiki.apache.org/spamassassin/RescoringProcess
> >
> > says that score assignment is based on volunteers masschecking
> > against their own corpora, which likely are fairly current.
> >
> > Can anybody provide information on how current the contributor corpora
> > are?
>
> Very -- mine for example are typically 24 hours old, max.
> These issues are very well represented there.
>
> Please bear in mind, also, that there are 5 different rules that
> use RFCI data, and they have wildly varying accuracies and scores:
>
> SPAM%    HAM%    S/O    RANK    SCORE   NAME
>  3.7247  0.0540  0.986        0.85    2.60    DNS_FROM_RFC_DSN
>  2.2447  0.1700  0.930        0.73    1.94    DNS_FROM_RFC_BOGUSMX
> 15.1533  4.6068  0.767        0.51    1.45    DNS_FROM_RFC_POST
> 18.6219  8.6003  0.684        0.49    1.71    DNS_FROM_RFC_ABUSE
> 6.4258   4.0476  0.614        0.48    0.20    DNS_FROM_RFC_WHOIS
>
> DNS_FROM_RFC_DSN fires on 3.7247% of spam, and only 0.054% of ham, giving
> it an accuracy of 98.6%.
>
> OTOH, DNS_FROM_RFC_POST, DNS_FROM_RFC_ABUSE, and DNS_FROM_RFC_WHOIS will
> likely not make it into the next release going by those rates.
>
> Those are "live" results from our mass-checks (see wiki for details).
>
> --j.


Thanks for confirming Kurt complaint. 

-- 
_____________________________________
John Andersen

Re: Concerned with scores for from rfc-ignorant.org

Reply via email to