On Dec 8, 2005, at 10:51 AM, Sonja Löhr wrote:
Thank you both, I found it
(I really asked a bit too early, sorry)
The highlighter works correct if I use my custom Analyzer during
indexing
(and for QueryParser), BUT
when preparing the TokenStream to feed the highlighter, I must NOT
use it.
TokenStream tStream = new GermanAnalyzer().tokenStream("body", new
StringReader(bodyText));
System.out.println( highlighter.getBestFragments(tStream, bodyText,
4, "
..... "));
works, wheras
TokenStream tStream = new GermanHtmlAnalyzer().tokenStream("body", new
StringReader(bodyText));
System.out.println( highlighter.getBestFragments(tStream, bodyText,
4, "
..... "));
gives rubbish highlighting.
GermanHtmlAnalyzer feeds a normal GermanAnalyzer with a shortened
String
(native characters) if the input contains decimal or html entities,
but then
I'm totally confused why there is a problem with pdf text and not
with HTML
text...
The likely reason is that the token offsets fed to the highlighter
don't jive with the positions of the text in the text you're
highlighting. You're generating token offsets for strings that have
been replaced (and likely different sizes), but highlighting the
original text with the entities left intact.
Maybe??
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]