Re: [SAtalk] Another English-centric rule

Daniel Quinlan Tue, 05 Mar 2002 12:37:02 -0800

Matthew Cline <[EMAIL PROTECTED]> writes:

> For those of you who find that English-centricity helps to filter spam,
> here's a rule that looks for non-ASCII encoding in the subject line:
>
> header   NON_ASCII_ENC_SUBJ     Subject =~ /=\?(?:euc-kr|big5|iso-8859-1)\?/
> describe NON_ASCII_ENC_SUBJ     Non-ASCII encoded subject
>
> It just does EUC Korean, Big5 Chinese and ISO Western encodings now,
> but it's easy enough to add other encodings.


Actually, iso-8859-1 is for English.  Also, some non-spam mail
programs unwittingly use iso-8859-1 encoding in the Subject: line for
plain old ASCII.

This US/English-specific approach is fundamentally broken.  spamassassin
should be able to figure out the predominant MIME encoding of emails and
score uncommon ones differently.

I guess I'll just have to code something up.  Here's a more complete
version of the idea I mentioned earlier:

For each of the following attributes, keep two counters.  One is total
messages and one is total tagged as spam.

  - top-level domain
  - encoding of body (also track "no encoding")
  - Subject: MIME encoding type (also track "not encoded")
  - Content-Type: header (also track "not included")
  - Content-Transfer-Encoding: header (also track "not included")

For example, these are real numbers by domain from one of my email
addresses.  For these numbers, I just used the domain from the ^From
line.

  $domain_count{'com'} = 254517;
  $domain_count{'net'} = 20019;
  $domain_count{'org'} = 119010;
  $domain_count{'cn'} = 149;
  $domain_count{'ca'} = 121;
  $domain_count{'jp'} = 211;
  $domain_count{'de'} = 4364;
  $domain_count{'tw'} = 115;
  $domain_count{'ws'} = 39;
  $domain_count{'ru'} = 85;
  $domain_count{'br'} = 199;

  $domain_spam{'com'} = 907;
  $domain_spam{'net'} = 176;
  $domain_spam{'org'} = 31;
  $domain_spam{'cn'} = 36;
  $domain_spam{'ca'} = 15;
  $domain_spam{'jp'} = 22;
  $domain_spam{'de'} = 25;
  $domain_spam{'tw'} = 23;
  $domain_spam{'ws'} = 26;
  $domain_spam{'ru'} = 20;
  $domain_spam{'br'} = 15;

Then you calculate spam percentages:

  ws  = 66.67 %
  cn  = 24.16 %
  ru  = 23.53 %
  tw  = 20.00 %
  ca  = 12.40 %
  jp  = 10.43 %
  br  = 07.54 %
  net = 00.88 %
  de  = 00.57 %
  com = 00.36 %
  org = 00.03 %

Domains with the worst percenatages (at the top) get penalized.  Yes,
the majority of my spam is .com, but you have to look at the percentages
to derive any meaningful information.  (BTW, my spam percentages are
probably worse than they appear here, since my spam folder was
half-heartedly hand-filtered prior to SA.)

If someone wants to go gung-ho, they can add a rule to penalize *any*
uncommon domain/encoding/etc., regardless of the number of spam
messages.  In fact, I think if the GA is determining the score, we could
safely have that rule by default.

Regarding the initialization problem, we can have a mode to put the
regular auto-whitelist in read-only mode (I already put in a feature
request for this) and then you can mass process a sufficient amount of
your past email without any negative impact.

- Dan

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Another English-centric rule

Reply via email to