Gunnlaugur Thor Briem created SOLR-4851:
-------------------------------------------

             Summary: Highlighter duplicates numeric token in snippet when term 
vectors/positions/offsets on
                 Key: SOLR-4851
                 URL: https://issues.apache.org/jira/browse/SOLR-4851
             Project: Solr
          Issue Type: Bug
          Components: highlighter
    Affects Versions: 3.6.2
            Reporter: Gunnlaugur Thor Briem


With original text {{Population 5.000 - 9.999}} indexed with termVectors, 
termPositions and termOffsets, the Highlighter produces snippets like 
{{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. 
Note the duplicated {{5}} before the {{<em}}; that's the bug.

This does not happen when {{useFastVectorHighlighter=true}}.

It also does not happen in a field without termVectors, termPositions and 
termOffsets.

To reproduce, field definitions:

{code:xml}
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    ...

    <field name="name" type="text" indexed="true" stored="true" />
    <field name="descr" type="text" indexed="true" stored="true" 
termVectors="true" termOffsets="true" termPositions="true" />
{code}

All configured and explicit parameters, from {{echoParams=all}}:

{code:javascript}
{
defType: "edismax",
echoParams: "all",
facet.mincount: "1",
fl: "id",
hl.fl: "id name tag cat descr dim dimvalue provider source_source text",
hl.fragsize: "200",
hl.mergeContiguous: "true",
hl.simple.post: "</em>",
hl.simple.pre: "<em class="match">",
hl.snippets: "4",
hl.usePhraseHighlighter: "true",
hl: "true",
q.alt: "*:*",
q: "5000",
qf: " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 
provider^2 source_source^2 text^2 ",
qt: "dismax",
rows: "10",
sort: "score desc"
}
{code}

and a piece of text containing numbers with thousand separators, e.g. 
“Demographics and income: Income distribution: Number of HHs earning &gt; 
US$5,000 p.a. (constant 2005 prices) by country”

The highlighting response I get:

{code:javascript}
{
name: [
  "Demographics and income: Income distribution: Number of HHs earning &gt; 
US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
  "Number of households with disposable income of more than US$5<em 
class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}

Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in 
the {{name}} field snippet. The only difference between these fields is 
termVectors, termPositions and termOffsets, so those settings are presumably 
relevant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to