Matthew Cline <[EMAIL PROTECTED]> writes: > For those of you who find that English-centricity helps to filter spam, > here's a rule that looks for non-ASCII encoding in the subject line: > > header NON_ASCII_ENC_SUBJ Subject =~ /=\?(?:euc-kr|big5|iso-8859-1)\?/ > describe NON_ASCII_ENC_SUBJ Non-ASCII encoded subject > > It just does EUC Korean, Big5 Chinese and ISO Western encodings now, > but it's easy enough to add other encodings.
Actually, iso-8859-1 is for English. Also, some non-spam mail programs unwittingly use iso-8859-1 encoding in the Subject: line for plain old ASCII. This US/English-specific approach is fundamentally broken. spamassassin should be able to figure out the predominant MIME encoding of emails and score uncommon ones differently. I guess I'll just have to code something up. Here's a more complete version of the idea I mentioned earlier: For each of the following attributes, keep two counters. One is total messages and one is total tagged as spam. - top-level domain - encoding of body (also track "no encoding") - Subject: MIME encoding type (also track "not encoded") - Content-Type: header (also track "not included") - Content-Transfer-Encoding: header (also track "not included") For example, these are real numbers by domain from one of my email addresses. For these numbers, I just used the domain from the ^From line. $domain_count{'com'} = 254517; $domain_count{'net'} = 20019; $domain_count{'org'} = 119010; $domain_count{'cn'} = 149; $domain_count{'ca'} = 121; $domain_count{'jp'} = 211; $domain_count{'de'} = 4364; $domain_count{'tw'} = 115; $domain_count{'ws'} = 39; $domain_count{'ru'} = 85; $domain_count{'br'} = 199; $domain_spam{'com'} = 907; $domain_spam{'net'} = 176; $domain_spam{'org'} = 31; $domain_spam{'cn'} = 36; $domain_spam{'ca'} = 15; $domain_spam{'jp'} = 22; $domain_spam{'de'} = 25; $domain_spam{'tw'} = 23; $domain_spam{'ws'} = 26; $domain_spam{'ru'} = 20; $domain_spam{'br'} = 15; Then you calculate spam percentages: ws = 66.67 % cn = 24.16 % ru = 23.53 % tw = 20.00 % ca = 12.40 % jp = 10.43 % br = 07.54 % net = 00.88 % de = 00.57 % com = 00.36 % org = 00.03 % Domains with the worst percenatages (at the top) get penalized. Yes, the majority of my spam is .com, but you have to look at the percentages to derive any meaningful information. (BTW, my spam percentages are probably worse than they appear here, since my spam folder was half-heartedly hand-filtered prior to SA.) If someone wants to go gung-ho, they can add a rule to penalize *any* uncommon domain/encoding/etc., regardless of the number of spam messages. In fact, I think if the GA is determining the score, we could safely have that rule by default. Regarding the initialization problem, we can have a mode to put the regular auto-whitelist in read-only mode (I already put in a feature request for this) and then you can mass process a sufficient amount of your past email without any negative impact. - Dan _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk