John D. Hardin writes: > On Thu, 12 Oct 2006, Kurt Fitzner wrote: > > For the purposes of SpamAssassin, it only matters if spam is > > filtered and ham is let through. As I keep harping on, I don't > > think it's SpamAssassin's job to crusade for abuse@/postmaster@ > > compliance. > > > > The rules in question almost by definition don't address spam, > > they address whether people are peeved at how hard it is to > > contact a domain's postmaster. Which is why I dispute the score > > attached to them. > > Those rules *do* address spam. As was explained, across the entire > corpus the RFCI results are a reliable enough spam indicator to > justify the score. If the scores weren't based on masscheck results > then you might be able to argue that they were assigned on an > emotional basis to forward a given agenda.
Exactly, thanks John! > > The corpus for ham is almost four years old. Does it address the > > current email volumes that are sent today? I downloaded and > > checked the latest hard_ham, and it has zero emails sent from > > yahoo.com. > > THAT is a valid basis for objection. > > > If you want to have and justify rules that target RFC compliance, > > then there needs to be justification that outgoing spam volumes > > and RFC 2821 compliance are linked. I make the claim that a major > > source of ham email is getting dangerously high spam scores and > > that there is little to nothing in the corpus that is aimed at > > preventing this particular rule from malfunctioning. > > ...except your posts so far have been far more ranting about RFCI > itself rather than suggesting the corpus is stale. > > The corpus may indeed be stale. If that's the case then the problem > extends far beyond the RFCI rules as the base scores for *all* rules > are based on the corpus. > > However, > > http://wiki.apache.org/spamassassin/RescoringProcess > > says that score assignment is based on volunteers masschecking > against their own corpora, which likely are fairly current. > > Can anybody provide information on how current the contributor corpora > are? Very -- mine for example are typically 24 hours old, max. These issues are very well represented there. Please bear in mind, also, that there are 5 different rules that use RFCI data, and they have wildly varying accuracies and scores: SPAM% HAM% S/O RANK SCORE NAME 3.7247 0.0540 0.986 0.85 2.60 DNS_FROM_RFC_DSN 2.2447 0.1700 0.930 0.73 1.94 DNS_FROM_RFC_BOGUSMX 15.1533 4.6068 0.767 0.51 1.45 DNS_FROM_RFC_POST 18.6219 8.6003 0.684 0.49 1.71 DNS_FROM_RFC_ABUSE 6.4258 4.0476 0.614 0.48 0.20 DNS_FROM_RFC_WHOIS DNS_FROM_RFC_DSN fires on 3.7247% of spam, and only 0.054% of ham, giving it an accuracy of 98.6%. OTOH, DNS_FROM_RFC_POST, DNS_FROM_RFC_ABUSE, and DNS_FROM_RFC_WHOIS will likely not make it into the next release going by those rates. Those are "live" results from our mass-checks (see wiki for details). --j.