Re: Current best-practices around normalize_charset?

Jay A. Sekora Fri, 14 Mar 2014 12:06:49 -0700

On Wed, 2014-03-12 at 19:04 -0700, Ivo Truxa wrote:

> Your message is a few months old, but I see no answer, and stumbled upon it
> when writing an enhanced version of the normalize_charset feature, so
> thought that I could perhaps help.


Thanks!  I'm glad to hear of your experiences.

> [R]egardless whether
> you use normalizing or not, as long as you need to match non-ASCII patterns,
> you need to write rules also in Unicode anyway, because you cannot reject
> Unicode messages.

Indeed!  And even if you only want to accept messages in English (or
some other ASCII-supported language), nowadays it's not at all uncommon
for messages to have dingbats or printer's quotation marks in them -- or
one of your correspondents might be sitting at a relative's computer or
in an internet cafe somewhere and the subject line might get the Chinese
equivalent of "Re:" prepended to it, or the body might have a disclaimer
in French appended.

> Another possibility may be normalizing, instead to UTF, to plain 7bit
> US-ASCII. The currently proposed patch for ASCII normalizing transliterates
> also non-Latin alphabets. The patch was proposed to the dev list, so
> impatient and courageous users might want to try it on a non-production
> server, but be warned that it is not any official code (at least not now),
> and currently very little tested.

Interesting idea!  I searched in the spamassassin-dev archives but I
don't think I found the right patch; could you point me at it?

How do you handle non-alphabetic scripts (like CJK, where a character
may have multiple pronunciations both within and between languages)?
Seems like just normalizing them to U+NNNN might be better than trying
to transcribe them.  (And that would let a brave or foolhardy mail
administrator write rules to match patterns seen in, say,
Chinese-language spam even without knowing Chinese, or even without
knowing what language the spam was in.)

Anyway, glad to hear that normalize_charset hasn't been causing you
problems, and for us, normalizing to UTF8 is almost certainly what we
want if it's reasonably safe.

Jay

-- 
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory

Re: Current best-practices around normalize_charset?

Reply via email to