Le 19 janv. 2011 à 20:56, Bill Janssen a écrit : > Paul Libbrecht <p...@hoplahup.net> wrote: > >> So you are only indexing "analyzed" and querying "analyzed". Is that correct? > > Yes, that's correct. I fall back to StandardAnalyzer if no > language-specific analyzer is available.
> >> Wouldn't it be better to prefer precise matches (a field that is >> analyzed with StandardAnalyzer for example) but also allow matches are >> stemmed. > > StandardAnalyzer isn't quite precise, is it? StandardFilter does some > kind of English-centric alterations to things. from here: http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html I can only conclude that it handles correctly the characters variety but does not stemming. The default constructor of StandardAnalyzer comes with a bunch of stop-words but they are easily deactivatable. I think it's quite precise, and certainly a lot more precise than removing the aux of chevaux! > Perhaps the approach you suggest would be slightly better, but I'd have > to see numbers on that from some reasonable corpus to be convinced it > would be worth it. I am not sure I have these. I did several changes of this sort and the precision and recall measures went better in particular in presence of language-indication failure which happened to be very common in our authoring environment. paul --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org