My 3.01 (bog-standard, no changes to defaults yet) testbed machine scored them at 4.9 and let them through!
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on
myother.do.main
X-Spam-Level: ****
X-Spam-Status: No, score=4.9 required=5.0 tests=ALL_TRUSTED,DNS_FROM_RFC_POST, FROM_ENDS_IN_NUMS,FROM_HAS_ULINE_NUMS,HTML_60_70, HTML_FONT_LOW_CONTRAST,HTML_IMAGE_ONLY_08,HTML_MESSAGE, HTTP_ESCAPED_HOST,HTTP_EXCESSIVE_ESCAPES,MIME_HTML_ONLY, NORMAL_HTTP_TO_IP,WEIRD_PORT autolearn=no version=3.0.1
I'm sure the "ALL_TRUSTED" isn't helping any, but that doesn't completely explain the 6.2 drop in score.
The ALL_TRUSTED is likely a misconfiguration, or, more specifically, a lack of required configuration. That's a pretty heavy impact on the score that you can fix easily. At -3.3 it's half of your "problem".
If you have a NATed mailserver, you MUST set trusted_networks manually. SA cannot reasonably decipher where network borders are in all cases, and assumes that the first non-reserved IP is your border MX. However, if your mailserver is NATed, this causes SA to trust an outside host. Not good.
Bayes also seems to be missing from your 3.0 results. That's another heavy hit. The originals hit BAYES_99, this missed entirely. That's 1.88 points of hit using 3.x scores.
So right off the bat you can account for over 5 points of the score difference.
As for the remaining 1.2 points, SA doesn't have any criteria to make scores ramp up to some arbitrarily high target. SA's scores are assigned to try to correctly place spam and ham in the right bins at a threshold of 5.0.
How high the score of a given spam winds up being after it goes over 5.0 is just an accident of how the scores stack after it's done balancing out FPs vs FNs.
You can look at STATISTICS-set3.txt and see the big differences between the two releases. The difference widens consideraby as scores ramp up.
SA 3.0.1 has a much lower overall score bias than 2.64 has, but it's got a better FN ratio at 5.0 making it slightly higher in the overall precision ratings.
SA scores are tuned such that FPs are considered to be much worse than FNs. In theory the guideline for the algorithm is that FPs are 100 times as bad as FNs, and SA 3.0's performance more closely resembles this than 2.64 does (2.64 is about 10:1 fn:fp ratio, 3.0.1 is about 78:1)
3.0.1-set3: # SUMMARY for threshold 5.0: # Correctly non-spam: 29443 99.97% # Correctly spam: 27220 97.53% # False positives: 9 0.03% # False negatives: 688 2.47% # TCR(l=50): 24.523726 SpamRecall: 97.535% SpamPrec: 99.967%
2.64-set3: # SUMMARY for threshold 5.0: # Correctly non-spam: 15550 46.59% (99.90% of non-spam corpus) # Correctly spam: 17648 52.87% (99.08% of spam corpus) # False positives: 15 0.04% (0.10% of nonspam, 1133 weighted) # False negatives: 164 0.49% (0.92% of spam, 437 weighted) # TCR: 74.527197 SpamRecall: 99.079% SpamPrec: 99.915% FP: 0.04% FN: 0.49%
Look at the percentages of correctly non-spam. Based on the statistics, SA 3.0.1 will FP on about 3 in 10k messages. 2.64 will FP on 10 in 10k, over 3 times the FP rate for 2.64.
(don't be mislead by the percentages in the "false positives" line. That's as a percentage of the whole corpus and is biased by the ratio of the corpus itself)
3.0.1 however, has nearly almost 3 times the FN rate (247/10k vs 92/10k). Still, the difference here is less (3.3 times the fps vs 2.64 times the FNs
Which matters more? FPs or FNs? I'd still go for avoiding FPs.