On Friday 13 October 2006 00:23, Justin Mason wrote: > John D. Hardin writes: > > On Thu, 12 Oct 2006, Kurt Fitzner wrote: > > > For the purposes of SpamAssassin, it only matters if spam is > > > filtered and ham is let through. As I keep harping on, I don't > > > think it's SpamAssassin's job to crusade for abuse@/postmaster@ > > > compliance. > > > > > > The rules in question almost by definition don't address spam, > > > they address whether people are peeved at how hard it is to > > > contact a domain's postmaster. Which is why I dispute the score > > > attached to them. > > > > Those rules *do* address spam. As was explained, across the entire > > corpus the RFCI results are a reliable enough spam indicator to > > justify the score. If the scores weren't based on masscheck results > > then you might be able to argue that they were assigned on an > > emotional basis to forward a given agenda. > > Exactly, thanks John! > > > > The corpus for ham is almost four years old. Does it address the > > > current email volumes that are sent today? I downloaded and > > > checked the latest hard_ham, and it has zero emails sent from > > > yahoo.com. > > > > THAT is a valid basis for objection. > > > > > If you want to have and justify rules that target RFC compliance, > > > then there needs to be justification that outgoing spam volumes > > > and RFC 2821 compliance are linked. I make the claim that a major > > > source of ham email is getting dangerously high spam scores and > > > that there is little to nothing in the corpus that is aimed at > > > preventing this particular rule from malfunctioning. > > > > ...except your posts so far have been far more ranting about RFCI > > itself rather than suggesting the corpus is stale. > > > > The corpus may indeed be stale. If that's the case then the problem > > extends far beyond the RFCI rules as the base scores for *all* rules > > are based on the corpus. > > > > However, > > > > http://wiki.apache.org/spamassassin/RescoringProcess > > > > says that score assignment is based on volunteers masschecking > > against their own corpora, which likely are fairly current. > > > > Can anybody provide information on how current the contributor corpora > > are? > > Very -- mine for example are typically 24 hours old, max. > These issues are very well represented there. > > Please bear in mind, also, that there are 5 different rules that > use RFCI data, and they have wildly varying accuracies and scores: > > SPAM% HAM% S/O RANK SCORE NAME > 3.7247 0.0540 0.986 0.85 2.60 DNS_FROM_RFC_DSN > 2.2447 0.1700 0.930 0.73 1.94 DNS_FROM_RFC_BOGUSMX > 15.1533 4.6068 0.767 0.51 1.45 DNS_FROM_RFC_POST > 18.6219 8.6003 0.684 0.49 1.71 DNS_FROM_RFC_ABUSE > 6.4258 4.0476 0.614 0.48 0.20 DNS_FROM_RFC_WHOIS > > DNS_FROM_RFC_DSN fires on 3.7247% of spam, and only 0.054% of ham, giving > it an accuracy of 98.6%. > > OTOH, DNS_FROM_RFC_POST, DNS_FROM_RFC_ABUSE, and DNS_FROM_RFC_WHOIS will > likely not make it into the next release going by those rates. > > Those are "live" results from our mass-checks (see wiki for details). > > --j.
Thanks for confirming Kurt complaint. -- _____________________________________ John Andersen