Sonja,
Do you have an example, or at least some relevant code, that would
help the community in helping resolve this?
Erik
On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:
Hi, all!
I have a question concerning analysis and highlighting. I'm indexing
multiple document formats (up to now, only html and pdf occured,
and use the
highlighter from the Lucene sandbox.
The documents text is extracted via JTidy and PDFBox, respectively,
then in
both indexing and search analysed with a custom subclass of
GermanAnalyzer,
which replaces character references and html entities.
Now the funny thing is that, even if I store the body text,
highlighter uses
wrong positions with lucene Docs stemming from pdf documents,
whereas html
is hightlighted correctly. I really don't have an explanation for
this
behaviour - for doc.get("body") is simply text, in either case, and
stop
words were also removed in ALL kinds of documents (and through an
instance
of the same analyzer passed to QueryParser.
Are there any hints to where I could find my error - or did anybody
else
encounter the same problem?
Thanks in advance!
sonja
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]