Michael Tokarev <m...@tls.msk.ru> wrote: > [...] > More and more email software uses UTF8 encoding nowadays, instead of > a single-byte encodings like KOI8, WINDOWS1251 and the like above. > And with UTF8, there's no simple way anymore to detect the language > actually used.
It is possible to *guess* language used anyway e.g. based on "typical sequences of chars" in text itself. see "man Mail::SpamAssassin::Plugin::TextCat" - it supports ok_languages lists, it should be possible to add "per specific guessed language" scores. > It's worse: for example, thunderbird running with russian as a default > language will put "charset=koi8-r" even for 100% ascii emails unless > explicitly told to use ascii charset. "Charset=koi8-r" and 100% ascii > inside does not contradict with each other since ascii is a subset of > koi8-r, but obviously does not help to filter those. Very good point. -- [pl>en: Andrew] Andrzej Adam Filip : a...@onet.eu : a...@xl.wp.pl Man has made his bedlam; let him lie in it. -- Fred Allen