Hi all,
I have come across what I think is a curious but insidious bug with the
java lucene hit highlighter. I updated to the latest version of lucene
and the highlighter because I first found this problem in the lucene
v1.4 version, unfortunately its still there in v2.0.0 versions.
I am indexing XML documents and am also using the hit highlighter for
search results. This works perfectly in almost every case except for one.
in my I have this:
public class LuceneSearch implements
org.apache.lucene.search.highlight.Formatter
{
...
public String highlightTerm(String originalText , TokenGroup group)
{
if(group.getTotalScore()<=0)
{
return originalText;
}
return "<em>" + originalText + "</em>";
}
when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>
notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as
I know) but because its part of a much larger XML document it breaks the
whole thing.
the original XML is unremarkable with no strange characters surrounding
these terms - a snipit from the relevant paragraph from which these
highlighted terms come:
-> attached to the Acquisition Plan and signed off<-
has anyone seen anything like this before? is this a genuine new bug or
something of which the lucene folk (or at least whoever wrote the
highlighter) are aware? can anyone think of a way to fix this without
scanning every element in my result text for rogue spaces?
Thanks in advance
Jason.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]