This is a well-known problem: Wildcards cannot be analyzed by the query parser, because the analysis would destroy the wildcard characters; also stemming of parts of terms will never work. For Solr there is a workaround (MultiTermAware component), but it is also very limited and only works when all analysis components are MultiTermAware, what stemmers are not.
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Bayer Dennis [mailto:dennis.ba...@cursor.de] > Sent: Tuesday, December 11, 2012 10:50 AM > To: java-user@lucene.apache.org > Subject: Stemming and Wildcard - or fire and water > > Hello there, > my colleague and I ran into an example which didn't return the result size > which we were expecting. We discovered that there is a mismatch in > handling terms while indexing and searching. This issue is already discussed > several times in the internet as we found out later on, but in our point of > view it's a buggy behavior if, at least, using a German stemmer. > > Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k) > > Setup: > * Lucene 4.0.0 > * Use the GermanAnalyzer which internally uses a GermanStemmer > > Issue: > * Create an index for "Hersener" which has a common ending in German -> > the string is shortend to "hers" > * Search for "Hers" -> a result is found > * Search for "Hersen" -> a result is found because the input token is also > stemmed to "hers" > * Search for "Hers*" -> a result is found > * Search for "Hersen*" -> nothing is found because the analyzer does not > run > > Similiar examples can be constructed easily if umlauts are involved. > > Conclusion: > The search query which contains a wildcard should also be run through the > analyzer, because there are a lot of queries which would return nothing. The > lucene FAQ already as a topic related to this issue: > http://wiki.apache.org/lucene- > java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen > sitive.3F > > The example with "dog" and "dogs" works as long as only one character is > stemmed - which could be true in English for the majority. But if more > characters are involved lucene does not return anything instead of returning > a few additional items. Just consider "families" which is stemmed to "famili". > Searching for "familie*" wouldn't return no item. > > To find an ending for this initial post ;) : > Could this behavior made configurable in the standard? If not: > a) Why are the stemmers used by default if they can led to wrong results? > b) What can be done manually to stem queries containing wildcards, e.g. > overriding some parser. > > Best regards > Dennis > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org