Gunnlaugur Thor Briem created SOLR-4851:
-------------------------------------------
Summary: Highlighter duplicates numeric token in snippet when term
vectors/positions/offsets on
Key: SOLR-4851
URL: https://issues.apache.org/jira/browse/SOLR-4851
Project: Solr
Issue Type: Bug
Components: highlighter
Affects Versions: 3.6.2
Reporter: Gunnlaugur Thor Briem
With original text {{Population 5.000 - 9.999}} indexed with termVectors,
termPositions and termOffsets, the Highlighter produces snippets like
{{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}.
Note the duplicated {{5}} before the {{<em}}; that's the bug.
This does not happen when {{useFastVectorHighlighter=true}}.
It also does not happen in a field without termVectors, termPositions and
termOffsets.
To reproduce, field definitions:
{code:xml}
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="name" type="text" indexed="true" stored="true" />
<field name="descr" type="text" indexed="true" stored="true"
termVectors="true" termOffsets="true" termPositions="true" />
{code}
All configured and explicit parameters, from {{echoParams=all}}:
{code:javascript}
{
defType: "edismax",
echoParams: "all",
facet.mincount: "1",
fl: "id",
hl.fl: "id name tag cat descr dim dimvalue provider source_source text",
hl.fragsize: "200",
hl.mergeContiguous: "true",
hl.simple.post: "</em>",
hl.simple.pre: "<em class="match">",
hl.snippets: "4",
hl.usePhraseHighlighter: "true",
hl: "true",
q.alt: "*:*",
q: "5000",
qf: " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2
provider^2 source_source^2 text^2 ",
qt: "dismax",
rows: "10",
sort: "score desc"
}
{code}
and a piece of text containing numbers with thousand separators, e.g.
“Demographics and income: Income distribution: Number of HHs earning >
US$5,000 p.a. (constant 2005 prices) by country”
The highlighting response I get:
{code:javascript}
{
name: [
"Demographics and income: Income distribution: Number of HHs earning >
US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
"Number of households with disposable income of more than US$5<em
class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}
Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in
the {{name}} field snippet. The only difference between these fields is
termVectors, termPositions and termOffsets, so those settings are presumably
relevant.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]