A possible workaround could be to modify search terms with wildcard tokens by stemming them manually and creating a new search string. Searches for hersen* would be modified to hers* and return what you expect. Con is of course that you search for more than you specified.
Lars-Erik > -----Original Message----- > From: Bayer Dennis [mailto:dennis.ba...@cursor.de] > Sent: Tuesday, December 11, 2012 10:50 AM > To: java-user@lucene.apache.org > Subject: Stemming and Wildcard - or fire and water > > Hello there, > my colleague and I ran into an example which didn't return the result > size which we were expecting. We discovered that there is a mismatch > in handling terms while indexing and searching. This issue is already > discussed several times in the internet as we found out later on, but > in our point of view it's a buggy behavior if, at least, using a German > stemmer. > > Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k) > > Setup: > * Lucene 4.0.0 > * Use the GermanAnalyzer which internally uses a GermanStemmer > > Issue: > * Create an index for "Hersener" which has a common ending in German > -> the string is shortend to "hers" > * Search for "Hers" -> a result is found > * Search for "Hersen" -> a result is found because the input token is > also stemmed to "hers" > * Search for "Hers*" -> a result is found > * Search for "Hersen*" -> nothing is found because the analyzer does > not run > > Similiar examples can be constructed easily if umlauts are involved. > > Conclusion: > The search query which contains a wildcard should also be run through > the analyzer, because there are a lot of queries which would return > nothing. The lucene FAQ already as a topic related to this issue: > http://wiki.apache.org/lucene- > java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen > sitive.3F > > The example with "dog" and "dogs" works as long as only one character > is stemmed - which could be true in English for the majority. But if > more characters are involved lucene does not return anything instead > of returning a few additional items. Just consider "families" which is > stemmed to "famili". > Searching for "familie*" wouldn't return no item. > > To find an ending for this initial post ;) : > Could this behavior made configurable in the standard? If not: > a) Why are the stemmers used by default if they can led to wrong results? > b) What can be done manually to stem queries containing wildcards, e.g. > overriding some parser. > > Best regards > Dennis > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org