Re: TextCat - Language help

Charles Sprickman Thu, 25 Jun 2015 22:35:56 -0700

Henrik K <h...@hege.li> wrote:

> On Thu, Jun 25, 2015 at 09:37:44AM +0300, Henrik K wrote:
>> On Wed, Jun 24, 2015 at 07:37:28PM -0400, Charles Sprickman wrote:
>>> On Jun 22, 2015, at 5:21 PM, Marc Selig <a29508-spamassas...@sedacon.com> 
>>> wrote:
>>> 
>>>> On Mon, Jun 22, 2015 at 05:09:45PM -0400, Charles Sprickman wrote:
>>>> 
>>>>> Are there any other options for filtering based on language, or any known
>>>>> patches/fixes for TextCat to make it a bit less aggressive when it runs
>>>>> across gibberish that is probably not any particular language?
>>>> 
>>>> You could tinker with textcat_acceptable_score.  Increasing it slightly
>>>> (e.g. back to the old default of 1.05) seems to reduce those wild guesses.
>>> 
>>> I don?t quite follow what exactly this does, the explanation seems a bit 
>>> circular:
>>> 
>>> textcat_acceptable_score N (default: 1.05)
>>> "Include any language that scores at least textcat_acceptable_score in the 
>>> returned list of languages"
>>> 
>>> I?m bumping it up to see what happens, I?m also lowering 
>>> "textcat_max_languages? to 3.  How can I get more info about what this 
>>> plugin is doing into the headers?
>> 
>> The scoring is a bit vague yes.. basically 1.02 means that compared to the
>> "best result" (a vague ngram number) we only accept other results withing 2%
>> of that.  If score produces more results than textcat_max_languages then
>> everything is ignored.
>> 
>> I'm going to add some headers tags to trunk code soon, it will look 
>> something like this:
>> 
>> Jun 25 09:33:12.670 [30140] dbg: check: tagrun - tag TEXTCAT_RESULTS is now
>> ready, value: fi:96985(1.00) ro:112950(1.16) sv:113567(1.17) it:115650(1.19)
>> da:115656(1.19) fr:116506(1.20) af:117089(1.21) sr.us-ascii:117205(1.21)
>> sk.us-ascii:118124(1.22) en:118174(1.22) ms:118208(1.22)
>> hr.us-ascii:118639(1.22) id:119112(1.23) ca:119196(1.23) pt:119960(1.24)
>> hu:119986(1.24) sq:120081(1.24) nl:120105(1.24) es:120199(1.24)
>> no:120804(1.25)
>> 
>> Here you see the "ngram result" and percentile (score), "fi" is a clear
>> winner.  For sane results 1.02-1.05 score is good range.  You can reduce
>> max_languages to 1-2 if you want even more confidence.
> 
> Committed, if anyone wants to debug things, just replace current version
> with this.  Also added some hopefully clarifying things in the
> documentation.
> 
> http://svn.apache.org/repos/asf/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/TextCat.pm
> 
> You can add_header all Textcat _TEXTCATRESULTS_ or grep it from debug
> output.


Excellent, thanks so much for taking the time to do this.

I’m running the patch on one box and I’ve added the debug output to my own 
userprefs:

X-Spam-Languages: en
X-Spam-Scores: test-scores=DCC_CHECK=1.373,GTUBE=1000,NO_RECEIVED=-0.001,
        NO_RELAYS=-0.001
X-Spam-Textcat: en:104672(1.00) da:119123(1.14) ro:120823(1.15)
        fr:121487(1.16) nl:121492(1.16) af:121497(1.16) de:121606(1.16)
        ca:121909(1.16) sv:122529(1.17) pt:123547(1.18) es:123565(1.18)
        it:123847(1.18) no:125928(1.20) ms:126042(1.20) id:126454(1.21)
        sk.us-ascii:126635(1.21) hu:127719(1.22) sq:128756(1.23)
        cs.iso-8859-2:130009(1.24) fi:130055(1.24)

This should be very helpful in tuning things going forward.

Thanks,


Charles

Re: TextCat - Language help

Reply via email to