pdf and highlighting

Sonja Löhr Thu, 08 Dec 2005 01:25:01 -0800

Hi, all!

I have a question concerning analysis and highlighting. I'm indexing
multiple document formats (up to now, only html and pdf occured, and use the
highlighter from the Lucene sandbox.
The documents text is extracted via JTidy and PDFBox, respectively, then in
both indexing and search analysed with a custom subclass of GermanAnalyzer,
which replaces character references and html entities.


Now the funny thing is that, even if I store the body text, highlighter uses
wrong positions with lucene Docs stemming from pdf documents, whereas html
is hightlighted correctly.  I really don't have an explanation for this
behaviour - for doc.get("body") is simply text, in either case, and stop
words were also removed in ALL kinds of documents (and through an instance
of the same analyzer passed to QueryParser.

Are there any hints to where I could find my error - or did anybody else
encounter the same problem?

Thanks in advance!

sonja




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

pdf and highlighting

Reply via email to