DQ> Do you accept GPL modules?

It's hard, since the GPL is incompatible with the Artistic license, and I think
there are a lot of people who use SA who are presently extending it in ways
which are compatible with the SA license, but not with the GPL (they don't want
to release source back, or want to make their own local rule changes).  I think
some of those people would get quite uncomfortable with GPL'd bits in the
package.  It's sort of unclear to me what the implications of that might be too
-- would "GPL cancer" eat the whole project the minute a piece of it uses GPL
code?  Or can we use a GPL module where we don't touch the internals of the
module, but just call out to it from SA without having to GPL all of SA.  I'm
not a license expert (and certainly not religious).  Of course, maybe the
original author could be convinced to offer an Artistic license on his work,
then the problem would magically go away.

DQ> I'm working on adapting TextCat, a language guesser, for use in SA, but

Sounds very interesting.

DQ> The interface is a new user preference:
DQ>
DQ>   ok_languages  english spanish  # english and spanish are okay
DQ>   ok_languages  all              # do not check language

DQ> which feeds into this rule:
DQ>
DQ>   full UNDESIRED_LANGUAGE_BODY            eval:check_language()
DQ>   describe UNDESIRED_LANGUAGE_BODY        language of body is not ok
DQ>   score UNDESIRED_LANGUAGE_BODY           2.0

Such a module would actually have implications beyond just adding a
language-is-valid type rule to the locale-is-valid check that currently exists.
Locale is also currently used in determining which rules with "lang xx" prefixes
are considered by the engine.  We could switch "lang xx" to be based on the
language of the actual email rather than simply the setting locale happens to be
for the host that SA is running on.

DQ> Sometimes, especially for shorter strings, TextCat can't narrow down the
DQ> language to a single match and will return multiple possible languages.
DQ> In that case, the rule just checks the entire set of possible languages
DQ> against each ok_language.

I imagine this would probably happen more frequently in email than in "normal"
text, since emails tend to use abbreviations, weird characters, shorthand,
slang, etc. more than most document formats.

DQ> Does it work?  I tested 500 hand-filtered messages (collected from my
DQ> pre-SA days) with ok_locales set to "en" and ok_languages set to
DQ> "english".  Here are the number of matches for these rules:

DQ>  17 = UNDESIRED_LANGUAGE_BODY

That's not bad.  I think tghe value for "lang xx" selection of which rules to
apply would also increase the usefulness of those rules.  I have some questions
though:

1. What is the overhead of the language-analyzer?  How fast does it run over a
typical message?
2. What is the footprint in disk/memory consumption?  Does it have to load a
dictionary per language in order to be able to ID those languages?  That could
be a heavy load to add for many SA users.
3. The tests you've done would be way more interesting with a more international
set of sample messages, and with ok_locales != 'en' and ok_languages !=
"english".  Any European volunteers to try this out on their mailboxes?  I'm
guessing there's more multilingual mail and probably less difference between
languages there.  I'm betting those 17 messages are very much not English
(korean/chinese/russian/spanish?) and are more easily distinguished than say
French from Italian.


DQ> So, yes, it seems to work much better than the locale testing, but it
DQ> doesn't quite replace it.  It does seem like we could dump the
DQ> CHARSET_FARAWAY_BODY rule.  UNDESIRED_LANGUAGE_BODY works much better
DQ> anyway.

Not necessarily dump.  Leave em both in there and let the GA sort them out.

C


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to