I'm basically finished adapting TextCat, an open source language
guesser, for use in SA.  Thanks to the upstream author, it is now
licensed under the same terms as Perl.  At this point, I'm looking for
testing help and comments.

  - 76 different languages are currently recognized.
  - The level of accuracy is good.
  - Does the right thing if it can't guess or narrow down to a single guess.

Performance
-----------

The test takes some time, but it's easy to deactivate by setting the
ok_language to "all".  For speed testing, I used a small corpus of 24
messages (155k total).

  language test inactive = 35.3 seconds
  language test active   = 43.8 seconds

The speed is independent of language.  My original version (the upstream
TextCat simply converted into a module) required 49.3 seconds.  So, the
overhead of language guessing the 24 messages went from 14.0 seconds to
5.5 seconds and you still turn it off.  Also, the disk usage went down
from 312k to 108k for the language models (when installed).

Settings
--------

I assigned a score of 2.0 to the rule, but that may be too conservative.

To get it working, you basically set "ok_languages" in your user
preferences and that's it.  The patch includes appropriate changes to
the SA documentation.

The default setting is "english".  Incidentally, I think "all" would be
a better default for both ok_languages and ok_locales, but I decided to
follow the English precedent set by ok_locales.  Maybe I'll send another
patch to that effect, but please, let's not debate it again here.

Other changes
-------------

 - added an "all" option for ok_locales (easier way to deactivate test
   than changing the score, faster evaluation too)
 - fixed bug in get_my_locales() where an undef could be pushed into
   @locales

Download
--------

I didn't want to send a 140k email, so you'll have to download it.

The patch (textcat.patch) and the language models (lm.tar.gz, unpack it
from the top level of the spamassassin CVS tree) are located at:

  http://www.pathname.com/~quinlan/software/spamassassin/

Possible TODO
-------------

One thing that may need to eventually be added is any missing language
models for common spam languages and/or character sets.

Dan

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to