I write:

>> I don't think I've ever received a UTF-8 Korean spam,

dman <[EMAIL PROTECTED]> writes:
 
> That's why someone needs to convert the characters to ks_c_5601-1987
> and euc-kr for SA's tests.

Most of the spam is coming through as 8-bit ks_c_5601-1987.  That's
what the test should look for (and it's what I just checked into the
CVS tree) ... because it works.

I agree with Matt that SA should eventually decode all QP and base64
text into 8-bit characters, then SA tests can always test 8-bit, but
that just means that a bit more spam will match the KOREAN_UCE_SUBJECT
test.

Whether we should embark on converting between encodings (ks_c_5601-1987
to utf8, etc.) is another question.  That is a much more complicated
problem and probably only useful for language-specific word tests for
languages that are sent in more than one encoding.

> It would be nice if I could junk other foreign-language messages too
> since I can't read them.  That's a little harder to detect (ie when
> they are iso8859-1 or utf-8).

A solution is pending.  Take a look at:

  http://bugzilla.spamassassin.org/show_bug.cgi?id=293

Dan

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to