Re: pdf and highlighting

Erik Hatcher Thu, 08 Dec 2005 01:59:24 -0800

Sonja,

Do you have an example, or at least some relevant code, that wouldhelp the community in helping resolve this?


        Erik

On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:

Hi, all!

I have a question concerning analysis and highlighting. I'm indexing
multiple document formats (up to now, only html and pdf occured,and use the
highlighter from the Lucene sandbox.
The documents text is extracted via JTidy and PDFBox, respectively,then inboth indexing and search analysed with a custom subclass ofGermanAnalyzer,
which replaces character references and html entities.
Now the funny thing is that, even if I store the body text,highlighter useswrong positions with lucene Docs stemming from pdf documents,whereas htmlis hightlighted correctly. I really don't have an explanation forthisbehaviour - for doc.get("body") is simply text, in either case, andstopwords were also removed in ALL kinds of documents (and through aninstance
of the same analyzer passed to QueryParser.
Are there any hints to where I could find my error - or did anybodyelse
encounter the same problem?

Thanks in advance!

sonja




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdf and highlighting

Reply via email to