Craig, Do you accept GPL modules?
I'm working on adapting TextCat, a language guesser, for use in SA, but the one perl script (which I converted into a module) and the language definitions files are licensed with the GPL (by the upstream author). Here's the upstream source: http://odur.let.rug.nl/~vannoord/TextCat/ TextCat recognizes 69 different languages and seems to be reasonably accurate. It solves the problem that the character set of email is not always set or set accurately. It also solves the equally annoying problem that locales are often valid, but used for a language you don't care about (especially the case for ISO 8859-1). The interface is a new user preference: ok_languages english spanish # english and spanish are okay ok_languages all # do not check language which feeds into this rule: full UNDESIRED_LANGUAGE_BODY eval:check_language() describe UNDESIRED_LANGUAGE_BODY language of body is not ok score UNDESIRED_LANGUAGE_BODY 2.0 If TextCat can't figure out the language, the rule is false. Sometimes, especially for shorter strings, TextCat can't narrow down the language to a single match and will return multiple possible languages. In that case, the rule just checks the entire set of possible languages against each ok_language. Does it work? I tested 500 hand-filtered messages (collected from my pre-SA days) with ok_locales set to "en" and ok_languages set to "english". Here are the number of matches for these rules: 7 = CHARSET_FARAWAY 5 = CHARSET_FARAWAY_HEADERS 0 = CHARSET_FARAWAY_BODY 12 = total matched by CHARSET_FARAWAY rules 17 = UNDESIRED_LANGUAGE_BODY 21 = total matched by UNDESIRED_LANGUAGE_BODY and CHARSET_FARAWAY rules So, yes, it seems to work much better than the locale testing, but it doesn't quite replace it. It does seem like we could dump the CHARSET_FARAWAY_BODY rule. UNDESIRED_LANGUAGE_BODY works much better anyway. Dan _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk