Re: Concerned with scores for from rfc-ignorant.org

Justin Mason Fri, 13 Oct 2006 01:23:56 -0700

John D. Hardin writes:
> On Thu, 12 Oct 2006, Kurt Fitzner wrote:
> > For the purposes of SpamAssassin, it only matters if spam is
> > filtered and ham is let through.  As I keep harping on, I don't
> > think it's SpamAssassin's job to crusade for abuse@/postmaster@
> > compliance.
> >
> > The rules in question almost by definition don't address spam,
> > they address whether people are peeved at how hard it is to
> > contact a domain's postmaster.  Which is why I dispute the score
> > attached to them.
> 
> Those rules *do* address spam. As was explained, across the entire
> corpus the RFCI results are a reliable enough spam indicator to
> justify the score. If the scores weren't based on masscheck results
> then you might be able to argue that they were assigned on an
> emotional basis to forward a given agenda.


Exactly, thanks John!

> > The corpus for ham is almost four years old.  Does it address the
> > current email volumes that are sent today?  I downloaded and
> > checked the latest hard_ham, and it has zero emails sent from
> > yahoo.com.
> 
> THAT is a valid basis for objection.
> 
> > If you want to have and justify rules that target RFC compliance,
> > then there needs to be justification that outgoing spam volumes
> > and RFC 2821 compliance are linked.  I make the claim that a major
> > source of ham email is getting dangerously high spam scores and
> > that there is little to nothing in the corpus that is aimed at
> > preventing this particular rule from malfunctioning.
> 
> ...except your posts so far have been far more ranting about RFCI
> itself rather than suggesting the corpus is stale.
> 
> The corpus may indeed be stale. If that's the case then the problem
> extends far beyond the RFCI rules as the base scores for *all* rules
> are based on the corpus.
> 
> However,
> 
>   http://wiki.apache.org/spamassassin/RescoringProcess
>  
> says that score assignment is based on volunteers masschecking
> against their own corpora, which likely are fairly current.
> 
> Can anybody provide information on how current the contributor corpora
> are?

Very -- mine for example are typically 24 hours old, max.
These issues are very well represented there.

Please bear in mind, also, that there are 5 different rules that
use RFCI data, and they have wildly varying accuracies and scores:

SPAM%    HAM%    S/O    RANK    SCORE   NAME
 3.7247  0.0540  0.986  0.85    2.60    DNS_FROM_RFC_DSN                
 2.2447  0.1700  0.930  0.73    1.94    DNS_FROM_RFC_BOGUSMX            
15.1533  4.6068  0.767  0.51    1.45    DNS_FROM_RFC_POST               
18.6219  8.6003  0.684  0.49    1.71    DNS_FROM_RFC_ABUSE              
6.4258   4.0476  0.614  0.48    0.20    DNS_FROM_RFC_WHOIS

DNS_FROM_RFC_DSN fires on 3.7247% of spam, and only 0.054% of ham, giving
it an accuracy of 98.6%.  

OTOH, DNS_FROM_RFC_POST, DNS_FROM_RFC_ABUSE, and DNS_FROM_RFC_WHOIS will
likely not make it into the next release going by those rates.

Those are "live" results from our mass-checks (see wiki for details).

--j.

Re: Concerned with scores for from rfc-ignorant.org

Reply via email to