DQ> Do you accept GPL modules?
It's hard, since the GPL is incompatible with the Artistic license, and I think there are a lot of people who use SA who are presently extending it in ways which are compatible with the SA license, but not with the GPL (they don't want to release source back, or want to make their own local rule changes). I think some of those people would get quite uncomfortable with GPL'd bits in the package. It's sort of unclear to me what the implications of that might be too -- would "GPL cancer" eat the whole project the minute a piece of it uses GPL code? Or can we use a GPL module where we don't touch the internals of the module, but just call out to it from SA without having to GPL all of SA. I'm not a license expert (and certainly not religious). Of course, maybe the original author could be convinced to offer an Artistic license on his work, then the problem would magically go away. DQ> I'm working on adapting TextCat, a language guesser, for use in SA, but Sounds very interesting. DQ> The interface is a new user preference: DQ> DQ> ok_languages english spanish # english and spanish are okay DQ> ok_languages all # do not check language DQ> which feeds into this rule: DQ> DQ> full UNDESIRED_LANGUAGE_BODY eval:check_language() DQ> describe UNDESIRED_LANGUAGE_BODY language of body is not ok DQ> score UNDESIRED_LANGUAGE_BODY 2.0 Such a module would actually have implications beyond just adding a language-is-valid type rule to the locale-is-valid check that currently exists. Locale is also currently used in determining which rules with "lang xx" prefixes are considered by the engine. We could switch "lang xx" to be based on the language of the actual email rather than simply the setting locale happens to be for the host that SA is running on. DQ> Sometimes, especially for shorter strings, TextCat can't narrow down the DQ> language to a single match and will return multiple possible languages. DQ> In that case, the rule just checks the entire set of possible languages DQ> against each ok_language. I imagine this would probably happen more frequently in email than in "normal" text, since emails tend to use abbreviations, weird characters, shorthand, slang, etc. more than most document formats. DQ> Does it work? I tested 500 hand-filtered messages (collected from my DQ> pre-SA days) with ok_locales set to "en" and ok_languages set to DQ> "english". Here are the number of matches for these rules: DQ> 17 = UNDESIRED_LANGUAGE_BODY That's not bad. I think tghe value for "lang xx" selection of which rules to apply would also increase the usefulness of those rules. I have some questions though: 1. What is the overhead of the language-analyzer? How fast does it run over a typical message? 2. What is the footprint in disk/memory consumption? Does it have to load a dictionary per language in order to be able to ID those languages? That could be a heavy load to add for many SA users. 3. The tests you've done would be way more interesting with a more international set of sample messages, and with ok_locales != 'en' and ok_languages != "english". Any European volunteers to try this out on their mailboxes? I'm guessing there's more multilingual mail and probably less difference between languages there. I'm betting those 17 messages are very much not English (korean/chinese/russian/spanish?) and are more easily distinguished than say French from Italian. DQ> So, yes, it seems to work much better than the locale testing, but it DQ> doesn't quite replace it. It does seem like we could dump the DQ> CHARSET_FARAWAY_BODY rule. UNDESIRED_LANGUAGE_BODY works much better DQ> anyway. Not necessarily dump. Leave em both in there and let the GA sort them out. C _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk