hithighlighter bug

Jason Tue, 09 Jan 2007 18:35:33 -0800

Hi all,

I have come across what I think is a curious but insidious bug with thejava lucene hit highlighter. I updated to the latest version of luceneand the highlighter because I first found this problem in the lucenev1.4 version, unfortunately its still there in v2.0.0 versions.

I am indexing XML documents and am also using the hit highlighter forsearch results. This works perfectly in almost every case except for one.


in my I have this:

public class LuceneSearch implementsorg.apache.lucene.search.highlight.Formatter

{
...
        public String highlightTerm(String originalText , TokenGroup group)
        {
                if(group.getTotalScore()<=0)
                {
                        return originalText;
                }
                return "<em>" + originalText + "</em>";
        }

when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>

notice the space between the < and e in the second < em>

This only occurs for these search terms and for this document (as far asI know) but because its part of a much larger XML document it breaks thewhole thing.

the original XML is unremarkable with no strange characters surroundingthese terms - a snipit from the relevant paragraph from which thesehighlighted terms come:


-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug orsomething of which the lucene folk (or at least whoever wrote thehighlighter) are aware? can anyone think of a way to fix this withoutscanning every element in my result text for rogue spaces?


Thanks in advance
Jason.






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

hithighlighter bug

Reply via email to