> From: Jason Bertoch [mailto:[email protected]]
> Sent: Wednesday, May 26, 2010 3:34 PM
> On 2010/05/25 7:02 PM, Karsten Bräckelmann wrote:
> > On Wed, 2010-05-26 at 10:35 +1200, Jason Haar wrote:
> >
> > Not as far as ok_locales and the respective CHARSET_FARAWAY rules are
> > concerned, IIRC. They have been written long ago to trigger on the
> > char-sets used. They don't detect the char-set based on the actual
> > payload.
> >
>
> So where does that leave us? With the need for an update or addition
> to
> the FARAWAY rules? Also, what's the deal with normalize_charset? Can
> that have any impact on these cases where language/locale isn't
> detected?
Jason, I may be completely wrong, but this is what I get grepping
'normalize_charset' in 3.3.1:
Util/DependencyInfo.pm: desc => 'If you plan to use the normalize_charset
config setting to detect
Conf.pm:=item normalize_charset ( 0 | 1) (default: 0)
Conf.pm: setting => 'normalize_charset',
Conf.pm: $self->{parser}->lint_warn("config: normalize_charset
requires Perl 5.8.5 or later");
Conf.pm: $self->{parser}->lint_warn("config: normalize_charset
requires HTML::Parser 3.46 or later");
Conf.pm: $self->{parser}->lint_warn("config: normalize_charset
requires Encode::Detect");
Conf.pm: $self->{parser}->lint_warn("config: normalize_charset
requires Encode");
Conf.pm: $self->{normalize_charset} = 1;
You may see {normalize_charset} can be set. But... where is it used, then?
It may be it is used in a way I can't catch with grep, tough...
Anyway, according to perldoc, normalize_charset would "allow detecting the
character set" used in a text content (which I believe is what you are looking
for) and eventually convert the text to unicode.
Now, to me the encoding detection phase is probably less than an issue here,
because a wrong encoding specified in the content's header would impair
readability of the spam text by the recipient, which is counter-productive to
spammers. So, the really used encoding is probably always specified in the
header and you may use it to score mail with foreign encodings right now.
I don't believe this is going to make any difference anyway, since nowadays
most legit mail *and* spam are moving toward utf-8 (which is probably the same
encoding used in the sample you supplied). You would end having a less than
useful rule, then.
You instead may want to guess the *language* used. Textcat is the reply if you
are looking for this. But please note its algorithm is a statistic approach to
the language detection problem: it often detects a text as being in more than
one language, especially when the sample is too short and/or when it is too
"polluted" with foreign or (intentionally) mistyped words.
Regards,
Giampaolo