[SAtalk] ok_languages addition

Daniel Quinlan Fri, 19 Apr 2002 19:55:07 -0700

Craig,

Do you accept GPL modules?


I'm working on adapting TextCat, a language guesser, for use in SA, but
the one perl script (which I converted into a module) and the language
definitions files are licensed with the GPL (by the upstream author).
Here's the upstream source:

  http://odur.let.rug.nl/~vannoord/TextCat/

TextCat recognizes 69 different languages and seems to be reasonably
accurate.  It solves the problem that the character set of email is not
always set or set accurately.  It also solves the equally annoying
problem that locales are often valid, but used for a language you don't
care about (especially the case for ISO 8859-1).

The interface is a new user preference:

  ok_languages  english spanish  # english and spanish are okay
  ok_languages  all              # do not check language

which feeds into this rule:

  full UNDESIRED_LANGUAGE_BODY            eval:check_language()
  describe UNDESIRED_LANGUAGE_BODY        language of body is not ok
  score UNDESIRED_LANGUAGE_BODY           2.0

If TextCat can't figure out the language, the rule is false.

Sometimes, especially for shorter strings, TextCat can't narrow down the
language to a single match and will return multiple possible languages.
In that case, the rule just checks the entire set of possible languages
against each ok_language.

Does it work?  I tested 500 hand-filtered messages (collected from my
pre-SA days) with ok_locales set to "en" and ok_languages set to
"english".  Here are the number of matches for these rules:

  7 = CHARSET_FARAWAY
  5 = CHARSET_FARAWAY_HEADERS
  0 = CHARSET_FARAWAY_BODY
 12 = total matched by CHARSET_FARAWAY rules
 17 = UNDESIRED_LANGUAGE_BODY
 21 = total matched by UNDESIRED_LANGUAGE_BODY and CHARSET_FARAWAY rules

So, yes, it seems to work much better than the locale testing, but it
doesn't quite replace it.  It does seem like we could dump the
CHARSET_FARAWAY_BODY rule.  UNDESIRED_LANGUAGE_BODY works much better
anyway.

Dan

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] ok_languages addition

Reply via email to