Hi all,
I have come across what I think is a curious but insidious bug with the java lucene hit highlighter. I updated to the latest version of lucene and the highlighter because I first found this problem in the lucene v1.4 version, unfortunately its still there in v2.0.0 versions.

I am indexing XML documents and am also using the hit highlighter for search results. This works perfectly in almost every case except for one.

in my I have this:

public class LuceneSearch implements org.apache.lucene.search.highlight.Formatter
{
...
        public String highlightTerm(String originalText , TokenGroup group)
        {
                if(group.getTotalScore()<=0)
                {
                        return originalText;
                }
                return "<em>" + originalText + "</em>";
        }

when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>

notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as I know) but because its part of a much larger XML document it breaks the whole thing.

the original XML is unremarkable with no strange characters surrounding these terms - a snipit from the relevant paragraph from which these highlighted terms come:

-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug or something of which the lucene folk (or at least whoever wrote the highlighter) are aware? can anyone think of a way to fix this without scanning every element in my result text for rogue spaces?

Thanks in advance
Jason.






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to