SA's scores are GA evolved against a real corpus of real email. Absolutely no rules are added to SA unless they appear to be a good spam or nonspam indicator pattern, and occasionally rules get dropped because they don't have a strong S/O ratio when tested against the corpus.
Generally most spam is generated by badly written tools which violate RFC's a lot. Let's face it, spam tools are optimized to generate the most email as fast as possible, and aren't designed to have any reasonable degree of protocol correctness.
Most nonspam email comes from reasonably RFC compliant MUA's which don't do things like invalid date encodings, 8bit headers, nonexistent timezones, etc.
So some RFC violations are actually a VERY good indicator of spam. Sure, some RFC violations are common even in "good" MUA's, but that's why we have a GA in the spamassassin scoring system. But the differences in the emails generated by most nonspam MUA's and most spam tools are what drives the rules. So it's really got nothing to do with how "RFC correct" or incorrect the emails are, it's got to do with the real, measurable differences between the two kinds of email.
If broken MUA's and MTA's are normal in your traffic, you might have to hand tweak you SA scores to be more tolerant of them.
But if you want justification for why the score is so high, look at STATISTICS.txt for SA 2.43:
STATISTICS.txt: 1.172 7.161 0.002 1.00 0.94 3.80 SUBJ_FULL_OF_8BITS
STATISTICS.txt: 1.091 6.636 0.008 1.00 0.82 2.44 HEADER_8BITS
This means that SUBJ_FULL_OF_8BITS matched 7.161% of the total spam corpus and 0.002% of the nonspam corpus. That's a pretty strong swing with lots of spam hits, and very few nonspam hits. Hence the high score the GA generated is pretty well placed.
At 03:14 PM 1/24/2003 -0500, [EMAIL PROTECTED] wrote:
-----Original Message----- From: Vivek Khera [mailto:[EMAIL PROTECTED] Sent: Friday, January 24, 2003 11:21 AM To: [EMAIL PROTECTED] Subject: Re: [SAtalk] False positive for foreign language
>>>>> "J" == Jchen <[EMAIL PROTECTED]> writes:
J> Hi, it looks like spamassassin is giving false positive on emails written in foreign language, e.g. Korean, Chinese...
J> What configurations do i need to make? Thanks in advance.
J> SPAM: HEADER_8BITS (2.4 points) Headers include 3 consecutive 8-bit characters
J> SPAM: BASE64_ENC_TEXT (1.4 points) RAW: Message text disguised using base-64 encoding
J> SPAM: CARRIAGE_RETURNS (0.3 points) RAW: Message contains a lot of ^M characters
J> SPAM: SUBJ_FULL_OF_8BITS (3.8 points) Subject is full of 8-bit characters
V>The first and last indicate RFC violation. The headers are *not* V>permitted to contain non-ASCII characters. As much as you may think V>that may suck, them's the rules. The other two, well, they could use V>QP encoding, couldn't they?
Let's forget RFC for a second...
"Headers include 3 consecutive 8-bit characters" and "Subject is full of 8-bit characters" each contribute 2.4 and 3.8 points to the total score, are we punishing it twice for the same mistake?
Don't understand why SA set such a high score for RFC violation. Anything to do with spam? If I set the score to zero, how big the chance would be for a spam to get passed by the scanning?
------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk