It looks like WDF strips the 's (STEM_ENGLISH_POSSESSIVE flag) but doesn't reflect that in the end offset.
I'm not sure this is a bug, in that it seems OK to highlight the token minus its attached English possessive? It could be it was originally be design? E.g. you can see it here: http://jirasearch.mikemccandless.com/search.py?text=Python ... scroll down a bit and you'll see a Python's occurrence, with only Python highlighted. But then, if you use the dedicated EnglishPossessiveFilter, it would leave the offsets as you want (including the 's); so that's different behavior. Maybe open an issue for discussion about what the approach should be? Mike McCandless http://blog.mikemccandless.com On Tue, Dec 6, 2016 at 6:27 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello - i noticed something peculiar running Lucene/Solr 6.3.0. > > The plural vaccinatieprogramma's should have a startOffset of 0 and a > endOffset of 21 when passed through WordDelimiterFilter and/or stemmers but > it isn't, slightly messing up highlighted terms. > > wdf = new WordDelimiterFilter(new CannedTokenStream(new > Token("vaccinatieprogramma's", 0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, > null); > assertTokenStreamContents(wdf, > new String[] { "vaccinatieprogramma"}, > new int[] { 0 }, > new int[] { 21 }); > > [junit4] Suite: > org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestWordDelimiterFilter -Dtests.method=testOffsets > -Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG > -Dtests.timezone=Etc/GMT+10 -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 > [junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<< > [junit4] > Throwable #1: java.lang.AssertionError: endOffset 0 > expected:<21> but was:<19> > > I would expect the same behaviour a stemmers, the length of the term is > always the length of the original term. So if a user queries for a sigular > term, the whole plural (original) is highlighted. > > Am i missing something? Bug? > > Thanks, > Markus > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org