RE: pdf and highlighting

Sonja Löhr Thu, 08 Dec 2005 07:51:28 -0800


Thank you both, I found it 
(I really asked a bit too early, sorry)


The highlighter works correct if I use my custom Analyzer during indexing
(and for QueryParser), BUT
when preparing the TokenStream to feed the highlighter, I must NOT use it.

TokenStream tStream = new GermanAnalyzer().tokenStream("body", new
StringReader(bodyText));                
System.out.println( highlighter.getBestFragments(tStream, bodyText, 4, "
..... ")); 

works, wheras

TokenStream tStream = new GermanHtmlAnalyzer().tokenStream("body", new
StringReader(bodyText));                
System.out.println( highlighter.getBestFragments(tStream, bodyText, 4, "
..... ")); 

gives rubbish highlighting.

GermanHtmlAnalyzer feeds a normal GermanAnalyzer with a shortened String
(native characters) if the input contains decimal or html entities, but then
I'm totally confused why there is a problem with pdf text and not with HTML
text...

Here is the Analyzer's tokenStream method again, the resolveEntities does
just String replacement.

public TokenStream tokenStream(String fieldName, Reader reader)  {
        try {
                return new GermanAnalyzer().tokenStream(fieldName,
resolveEntities(reader));
        }
        catch(IOException ioe) {
                return null;
        }               
}


Greetings!
sonja





> -----Original Message-----
> From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
> Sent: Donnerstag, 8. Dezember 2005 14:02
> To: java-user@lucene.apache.org
> Subject: Re: pdf and highlighting
> 
> Wow, those were some great details.  But, as I hope you've 
> seen with some other recent issues, things become so much 
> clearer when you can isolate the issues.  This is one reason 
> that test-driven development with unit tests is so amazingly 
> helpful.  If you could isolate a single PDF going through the 
> processes, individually, such as parsing, tokenization, 
> indexing, and then highlighting, you will likely find the 
> problem.  It is difficult, at best, to follow along the 
> various layers you have created without really digging with a 
> lot of time.
> 
> If you have a unit test that still fails your expectations, 
> then a JUnit test and the offending PDF file would likely 
> yield immediate solutions from this list.
> 
>       Erik
> 
> On Dec 8, 2005, at 6:04 AM, Sonja Löhr wrote:
> 
> >
> > Hi, Eric and the other experts!
> >
> > I'll try to collect some code fragments.
> > Many things are configurable and I wrote a Crawler for 
> indexing, but 
> > the rest is very close to the examples in "Lucene in 
> Action". I hope I 
> > chose the appropriate snippets.
> >
> > The analyzer I use is created once and stored in a Config 
> object made 
> > available to almost every class, along with other configurable data.
> >
> > INDEXING:
> >
> > in JTidyHtmlHandler (extends CrawlDocumentHandler):
> >     // getBody() extracts the textual content under <body>
> >     String body = getBody(rawDoc);
> >     if(body == null) {
> >             return null;
> >     }
> >     setMainField(doc, body);
> >
> > =============================================================
> > in PdfBoxPDFHandler (extends CrawlDocumentHandler):
> >
> >       PDFTextStripper stripper = new PDFTextStripper();
> >       pddoc = new PDDocument(cosDoc);
> >       docText = stripper.getText(pddoc);
> >     [...]
> >       if (docText != null) {
> >             setMainField(doc, docText);
> >        }
> >
> > =================================================================
> > in CrawlDocumentHandler implements DocumentHandler (as 
> found in Eric's
> > book):
> >     public void setMainField(Document doc, String txt) {
> >             if (txt == null || txt.equals("")) return;
> >             if(conf.storeMainField()) {
> >                     doc.add(Field.Text(conf.mainFieldName, txt));
> >             }
> >             else doc.add(Field.UnStored(conf.mainFieldName, txt));
> >
> >     }
> > ===================================================================
> >
> > In CrawlIndexer:
> > while(crawler.hasNext()) {
> >    CrawlDocumentHandler handler = getHandler(assoc, suffix, mime);
> >    ...
> >    doc = handler.getDocument(onlineDoc.getIn());
> >     if (doc != null) {
> >             doc.add(Field.Keyword("url", onlineDoc.getUrl()));
> >             Iterator writers =
> > config.getWritersForUrl(onlineDoc.getUrl()).iterator();
> >             while(writers.hasNext()) {
> >                     ((IndexWriter)writers.next()).addDocument(doc);
> >             }
> >             
> >     }
> > }
> > ====================================================================
> >
> > (I have a Set of Index Objects each storing its writer which is 
> > initialised like this, analyzer again comes from Config:
> >
> > this.writer = new IndexWriter(dir, analyzer, true);
> >
> > 
> =====================================================================
> >
> > Ok, now the index is made up with stored body text of the 
> documents, 
> > each analyzed with my Extension of GermanAnalyzer:
> >
> >
> > GermanHtmlAnalyzer extends Analyzer:
> >     public TokenStream tokenStream(String fieldName, Reader 
> reader)  {
> >             try {
> >                     return new 
> GermanAnalyzer().tokenStream(fieldName,
> > resolveEntities(reader));
> >             }
> >             catch(IOException ioe) {
> >                     return null;
> >             }               
> >     }
> >
> > ( resovleEntities returns a StringReader in which for 
> example &#252; 
> > or &uuml; are replaced by 'ü') 
> > 
> ======================================================================
> > ==
> >
> >
> > SEARCH:
> >
> > //Here some snippets of the code that provides the JavaBeans to be 
> > passed to some JSP page:
> >
> > // By now the only implementation is HtmlFragmentDisplay 
> > FragmentDisplay fragDisp = 
> > (FragmentDisplay)Class.forName(displayClassName).newInstance();
> > IndexSearcher searcher = new IndexSearcher(dir);            
> > Query q = MultiFieldQueryParser.parse(query, new String[]{"body", 
> > "title"}, conf.getAnalyzer()); Hits hits = searcher.search(q); for( 
> > [hits to be shown to the user] ) {
> >     ...
> >     if(conf.storeMainField()) {
> >             
> result.setFragment(fragDisp.getDisplayText(doc.get("body"),
> > q));
> >     }
> >     else result.setFragment(fragDisp.getDisplayText(new
> > URL(doc.get("url")), q));
> >     ...
> >     results.add(result);
> > }
> >
> > 
> ======================================================================
> > =====
> >
> > In HtmlFragmentDisplay:
> >
> > public String getDisplayText(String bodyText, Query query) {
> >
> >     QueryScorer scorer = new QueryScorer(query);
> >     SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span 
> > class=\"highlighted\">","</span>");
> >     Highlighter highlighter = new Highlighter(formatter, scorer);
> >     Fragmenter fragmenter = new SimpleFragmenter(60);               
> >     highlighter.setTextFragmenter(fragmenter);
> >     Analyzer analyzer = conf.getAnalyzer();
> >     TokenStream tStream = analyzer.tokenStream("body", new 
> > StringReader(bodyText));
> >     return = highlighter.getBestFragments(tStream, 
> bodyText, 4, " .....
> > ");
> > }
> >
> > (getDisplayText(URL url, Query query) fetches the document 
> by its URL, 
> > again uses the DocumentHandlers and finally calls the above 
> method. I 
> > switched from not storing the body text to storing it, but 
> that didn't 
> > affect the highlighting problem.
> >
> > 
> ======================================================================
> > =====
> >
> > So...... Result.getFragment() is what the users sees on the 
> JSP page.
> > If it happens to be taken from a JTidy-indexed Lucene document, 
> > everything is well, if it comes from PdfBox, the wrong text is 
> > highlighted.
> > I also tried with QueryParser.parse() instead of 
> > MultiFieldQueryParser, but the output didn't change.
> >
> > Many many thanks if you read until here!
> >
> > And even more if you hava an idea where the error is likely to be 
> > found.
> >
> > sonja
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> >> Sent: Donnerstag, 8. Dezember 2005 10:59
> >> To: java-user@lucene.apache.org
> >> Subject: Re: pdf and highlighting
> >>
> >> Sonja,
> >>
> >> Do you have an example, or at least some relevant code, that would 
> >> help the community in helping resolve this?
> >>
> >>    Erik
> >>
> >> On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:
> >>
> >>>
> >>> Hi, all!
> >>>
> >>> I have a question concerning analysis and highlighting. I'm
> >> indexing
> >>> multiple document formats (up to now, only html and pdf
> >> occured, and
> >>> use the highlighter from the Lucene sandbox.
> >>> The documents text is extracted via JTidy and PDFBox, 
> respectively, 
> >>> then in both indexing and search analysed with a custom 
> subclass of 
> >>> GermanAnalyzer, which replaces character references and
> >> html entities.
> >>>
> >>> Now the funny thing is that, even if I store the body text, 
> >>> highlighter uses wrong positions with lucene Docs 
> stemming from pdf 
> >>> documents, whereas html is hightlighted correctly.  I 
> really don't 
> >>> have an explanation for this behaviour - for
> >> doc.get("body") is simply
> >>> text, in either case, and stop words were also removed in
> >> ALL kinds of
> >>> documents (and through an instance of the same analyzer passed to 
> >>> QueryParser.
> >>>
> >>> Are there any hints to where I could find my error - or 
> did anybody 
> >>> else encounter the same problem?
> >>>
> >>> Thanks in advance!
> >>>
> >>> sonja
> >>>
> >>>
> >>>
> >>>
> >>>
> >> 
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: pdf and highlighting

Reply via email to