Hello there,
my colleague and I ran into an example which didn't return the result size 
which we were expecting. We discovered that there is a mismatch in handling 
terms while indexing and searching. This issue is already discussed several 
times in the internet as we found out later on, but in our point of view it's a 
buggy behavior if, at least, using a German stemmer.

Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)

Setup:
* Lucene 4.0.0
* Use the GermanAnalyzer which internally uses a GermanStemmer

Issue:
* Create an index for "Hersener" which has a common ending in German -> the 
string is shortend to "hers"
* Search for "Hers" -> a result is found
* Search for "Hersen" -> a result is found because the input token is also 
stemmed to "hers"
* Search for "Hers*" -> a result is found
* Search for "Hersen*" -> nothing is found because the analyzer does not run

Similiar examples can be constructed easily if umlauts are involved.

Conclusion:
The search query which contains a wildcard should also be run through the 
analyzer, because there are a lot of queries which would return nothing. The 
lucene FAQ already as a topic related to this issue: 
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F

The example with "dog" and "dogs" works as long as only one character is 
stemmed - which could be true in English for the majority. But if more 
characters are involved lucene does not return anything instead of returning a 
few additional items. Just consider "families" which is stemmed to "famili". 
Searching for "familie*" wouldn't return no item.

To find an ending for this initial post ;) :
Could this behavior made configurable in the standard? If not:
a) Why are the stemmers used by default if they can led to wrong results?
b) What can be done manually to stem queries containing wildcards, e.g. 
overriding some parser.

Best regards
Dennis




Reply via email to