Re: AW: Best practices for multiple languages?

Paul Libbrecht Wed, 19 Jan 2011 14:21:50 -0800

Le 19 janv. 2011 à 20:56, Bill Janssen a écrit :

> Paul Libbrecht <p...@hoplahup.net> wrote:
> 
>> So you are only indexing "analyzed" and querying "analyzed". Is that correct?
> 
> Yes, that's correct.  I fall back to StandardAnalyzer if no
> language-specific analyzer is available.


> 
>> Wouldn't it be better to prefer precise matches (a field that is
>> analyzed with StandardAnalyzer for example) but also allow matches are
>> stemmed.
> 
> StandardAnalyzer isn't quite precise, is it?  StandardFilter does some
> kind of English-centric alterations to things.

from here:
http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html

I can only conclude that it handles correctly the characters variety but does 
not stemming.
The default constructor of StandardAnalyzer comes with a bunch of stop-words 
but they are easily deactivatable.


I think it's quite precise, and certainly a lot more precise than removing the 
aux of chevaux!

> Perhaps the approach you suggest would be slightly better, but I'd have
> to see numbers on that from some reasonable corpus to be convinced it
> would be worth it.

I am not sure I have these.
I did several changes of this sort and the precision and recall measures went 
better in particular in presence of language-indication failure which happened 
to be very common in our authoring environment.

paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: AW: Best practices for multiple languages?

Reply via email to