Hi, all! I have a question concerning analysis and highlighting. I'm indexing multiple document formats (up to now, only html and pdf occured, and use the highlighter from the Lucene sandbox. The documents text is extracted via JTidy and PDFBox, respectively, then in both indexing and search analysed with a custom subclass of GermanAnalyzer, which replaces character references and html entities.
Now the funny thing is that, even if I store the body text, highlighter uses wrong positions with lucene Docs stemming from pdf documents, whereas html is hightlighted correctly. I really don't have an explanation for this behaviour - for doc.get("body") is simply text, in either case, and stop words were also removed in ALL kinds of documents (and through an instance of the same analyzer passed to QueryParser. Are there any hints to where I could find my error - or did anybody else encounter the same problem? Thanks in advance! sonja --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]