Michael Tokarev <m...@tls.msk.ru> wrote:
> [...]
> More and more email software uses UTF8 encoding nowadays, instead of
> a single-byte encodings like KOI8, WINDOWS1251 and the like above.
> And with UTF8, there's no simple way anymore to detect the language
> actually used.

It is possible to *guess* language used anyway 
e.g. based on "typical sequences of chars" in text itself.

see "man Mail::SpamAssassin::Plugin::TextCat" - it supports ok_languages
lists, it should be possible to add "per specific guessed language" scores.

> It's worse: for example, thunderbird running with russian as a default
> language will put "charset=koi8-r" even for 100% ascii emails unless
> explicitly told to use ascii charset.  "Charset=koi8-r" and 100% ascii
> inside does not contradict with each other since ascii is a subset of
> koi8-r, but obviously does not help to filter those.

Very good point.

-- 
[pl>en: Andrew] Andrzej Adam Filip : a...@onet.eu : a...@xl.wp.pl
Man has made his bedlam; let him lie in it.
  -- Fred Allen

Reply via email to