I'm basically finished adapting TextCat, an open source language guesser, for use in SA. Thanks to the upstream author, it is now licensed under the same terms as Perl. At this point, I'm looking for testing help and comments.
- 76 different languages are currently recognized. - The level of accuracy is good. - Does the right thing if it can't guess or narrow down to a single guess. Performance ----------- The test takes some time, but it's easy to deactivate by setting the ok_language to "all". For speed testing, I used a small corpus of 24 messages (155k total). language test inactive = 35.3 seconds language test active = 43.8 seconds The speed is independent of language. My original version (the upstream TextCat simply converted into a module) required 49.3 seconds. So, the overhead of language guessing the 24 messages went from 14.0 seconds to 5.5 seconds and you still turn it off. Also, the disk usage went down from 312k to 108k for the language models (when installed). Settings -------- I assigned a score of 2.0 to the rule, but that may be too conservative. To get it working, you basically set "ok_languages" in your user preferences and that's it. The patch includes appropriate changes to the SA documentation. The default setting is "english". Incidentally, I think "all" would be a better default for both ok_languages and ok_locales, but I decided to follow the English precedent set by ok_locales. Maybe I'll send another patch to that effect, but please, let's not debate it again here. Other changes ------------- - added an "all" option for ok_locales (easier way to deactivate test than changing the score, faster evaluation too) - fixed bug in get_my_locales() where an undef could be pushed into @locales Download -------- I didn't want to send a 140k email, so you'll have to download it. The patch (textcat.patch) and the language models (lm.tar.gz, unpack it from the top level of the spamassassin CVS tree) are located at: http://www.pathname.com/~quinlan/software/spamassassin/ Possible TODO ------------- One thing that may need to eventually be added is any missing language models for common spam languages and/or character sets. Dan _______________________________________________________________ Have big pipes? SourceForge.net is looking for download mirrors. We supply the hardware. You get the recognition. Email Us: [EMAIL PROTECTED] _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk