Cyrillic charsets normalization

Makoev Alan Sun, 15 Feb 2009 23:19:51 -0800
Here was recently a discussion on "charset normalization" feature (see e.g. 
http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42)
I ran a simple check on results that Encode::Detect::Detector facility yields.
I selected manually a set of 39 spam messages in Russian (those that were not 
MIME-encoded so I could see the contents by just tapping F3 in mc) - 32 with 
KOI8-R encoding, 6 with CP-1251 and 1 (ham) UTF-8. After that I ran the a 
simple script that feeds message body to Encode::Detect::Detector::detect, and 
got the following:
- among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be 
pardonable when making texts for humans, since these encodings differ only in 2 
letters, but it may affect negatively text analysis results) and 1 was not 
recognized at all (Encode::Detect::Detector::detect returned "undef");
- among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew);
- 1 UTF-8 message was detected correctly.
Of course, this set is by no means representative, but it illustrates possible 
drawbacks in using "normalize_charset" option.
Strictly speaking, one could expect such result, because the tricks widely used 
by spammers (replacing cyrillic letters with similar-looking latin ones, 
replacing digits with letters that look similar to digits and vice versa, 
adding random letter sequences to poison bayes, etc.) should affect the 
detection result.
And despite that SA ignores "charset=" statement in "Content-type:" header 
field. So my question is: Is it just due to developers' time shortage, or there 
are some reasons for avoiding using the charset indicated in the header field 
as a source charset for normalization?
Cyrillic charsets normalization

Reply via email to