On Wed, Jun 24, 2015 at 07:37:28PM -0400, Charles Sprickman wrote:
> On Jun 22, 2015, at 5:21 PM, Marc Selig <a29508-spamassas...@sedacon.com> 
> wrote:
> 
> > On Mon, Jun 22, 2015 at 05:09:45PM -0400, Charles Sprickman wrote:
> > 
> >> Are there any other options for filtering based on language, or any known
> >> patches/fixes for TextCat to make it a bit less aggressive when it runs
> >> across gibberish that is probably not any particular language?
> > 
> > You could tinker with textcat_acceptable_score.  Increasing it slightly
> > (e.g. back to the old default of 1.05) seems to reduce those wild guesses.
> 
> I don?t quite follow what exactly this does, the explanation seems a bit 
> circular:
> 
> textcat_acceptable_score N (default: 1.05)
> "Include any language that scores at least textcat_acceptable_score in the 
> returned list of languages"
> 
> I?m bumping it up to see what happens, I?m also lowering 
> "textcat_max_languages? to 3.  How can I get more info about what this plugin 
> is doing into the headers?

The scoring is a bit vague yes.. basically 1.02 means that compared to the
"best result" (a vague ngram number) we only accept other results withing 2%
of that.  If score produces more results than textcat_max_languages then
everything is ignored.

I'm going to add some headers tags to trunk code soon, it will look something 
like this:

Jun 25 09:33:12.670 [30140] dbg: check: tagrun - tag TEXTCAT_RESULTS is now
ready, value: fi:96985(1.00) ro:112950(1.16) sv:113567(1.17) it:115650(1.19)
da:115656(1.19) fr:116506(1.20) af:117089(1.21) sr.us-ascii:117205(1.21)
sk.us-ascii:118124(1.22) en:118174(1.22) ms:118208(1.22)
hr.us-ascii:118639(1.22) id:119112(1.23) ca:119196(1.23) pt:119960(1.24)
hu:119986(1.24) sq:120081(1.24) nl:120105(1.24) es:120199(1.24)
no:120804(1.25)

Here you see the "ngram result" and percentile (score), "fi" is a clear
winner.  For sane results 1.02-1.05 score is good range.  You can reduce
max_languages to 1-2 if you want even more confidence.

Reply via email to