Here was recently a discussion on "charset normalization" feature (see e.g.
http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42)
I ran a simple check on results that Encode::Detect::Detector facility yields.
I selected manually a set of 39 spam messages in Russian (those that were not
MIME-encoded so I could see the contents by just tapping F3 in mc) - 32 with
KOI8-R encoding, 6 with CP-1251 and 1 (ham) UTF-8. After that I ran the a
simple script that feeds message body to Encode::Detect::Detector::detect, and
got the following:
- among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be
pardonable when making texts for humans, since these encodings differ only in 2
letters, but it may affect negatively text analysis results) and 1 was not
recognized at all (Encode::Detect::Detector::detect returned "undef");
- among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew);
- 1 UTF-8 message was detected correctly.
Of course, this set is by no means representative, but it illustrates possible
drawbacks in using "normalize_charset" option.
Strictly speaking, one could expect such result, because the tricks widely used
by spammers (replacing cyrillic letters with similar-looking latin ones,
replacing digits with letters that look similar to digits and vice versa,
adding random letter sequences to poison bayes, etc.) should affect the
detection result.
And despite that SA ignores "charset=" statement in "Content-type:" header
field. So my question is: Is it just due to developers' time shortage, or there
are some reasons for avoiding using the charset indicated in the header field
as a source charset for normalization?