[
https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768911#comment-14768911
]
Shawn Heisey commented on LUCENE-6689:
--------------------------------------
I chose the latter workaround -- removing PRFF anywhere WDF is also used.
> Odd analysis problem with WDF, appears to be triggered by preceding analysis
> components
> ---------------------------------------------------------------------------------------
>
> Key: LUCENE-6689
> URL: https://issues.apache.org/jira/browse/LUCENE-6689
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.8
> Reporter: Shawn Heisey
>
> This problem shows up for me in Solr, but I believe the issue is down at the
> Lucene level, so I've opened the issue in the LUCENE project. We can move it
> if necessary.
> I've boiled the problem down to this minimum Solr fieldType:
> {noformat}
> <fieldType name="testType" class="solr.TextField"
> sortMissingLast="true" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
> replacement="$2"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="1"
> splitOnNumerics="1"
> stemEnglishPossessive="1"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1"
> catenateAll="0"
> preserveOriginal="1"
> />
> </analyzer>
> <analyzer type="query">
> <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
> replacement="$2"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="1"
> splitOnNumerics="1"
> stemEnglishPossessive="1"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> preserveOriginal="0"
> />
> </analyzer>
> </fieldType>
> {noformat}
> On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then index
> analysis puts aaa at term position 1 and bbb at term position 2. This seems
> perfectly reasonable to me. In Solr 4.9, both terms end up at position 2.
> This causes phrase queries which used to work to return zero hits. The exact
> text of the phrase query is in the original documents that match on 4.7.
> If the custom rbbi (which is included unmodified from the lucene icu analysis
> source code) is not used, then the problem doesn't happen, because the
> punctuation doesn't make it to the PRF. If the PatternReplaceFilterFactory
> is not present, then the problem doesn't happen.
> I can work around the problem by setting luceneMatchVersion to 4.7, but I
> think the behavior is a bug, and I would rather not continue to use 4.7
> analysis when I upgrade to 5.x, which I hope to do soon.
> Whether luceneMatchversion is LUCENE_47 or LUCENE_4_9, query analysis puts
> aaa at term position 1 and bbb at term position 2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]