Re: pdf and highlighting

Erik Hatcher Thu, 08 Dec 2005 05:03:33 -0800

Wow, those were some great details. But, as I hope you've seen withsome other recent issues, things become so much clearer when you canisolate the issues. This is one reason that test-driven developmentwith unit tests is so amazingly helpful. If you could isolate asingle PDF going through the processes, individually, such asparsing, tokenization, indexing, and then highlighting, you willlikely find the problem. It is difficult, at best, to follow alongthe various layers you have created without really digging with a lotof time.

If you have a unit test that still fails your expectations, then aJUnit test and the offending PDF file would likely yield immediatesolutions from this list.


        Erik

On Dec 8, 2005, at 6:04 AM, Sonja Löhr wrote:


Hi, Eric and the other experts!

I'll try to collect some code fragments.

Many things are configurable and I wrote a Crawler for indexing,but therest is very close to the examples in "Lucene in Action". I hope Ichose the

appropriate snippets.

The analyzer I use is created once and stored in a Config object made
available to almost every class, along with other configurable data.

INDEXING:

in JTidyHtmlHandler (extends CrawlDocumentHandler):
        // getBody() extracts the textual content under <body>
    String body = getBody(rawDoc);
    if(body == null) {
        return null;
    }
    setMainField(doc, body);

=============================================================
in PdfBoxPDFHandler (extends CrawlDocumentHandler):

      PDFTextStripper stripper = new PDFTextStripper();
      pddoc = new PDDocument(cosDoc);
      docText = stripper.getText(pddoc);
        [...]
      if (docText != null) {
                setMainField(doc, docText);
       }

=================================================================
in CrawlDocumentHandler implements DocumentHandler (as found in Eric's
book):
        public void setMainField(Document doc, String txt) {
                if (txt == null || txt.equals("")) return;
                if(conf.storeMainField()) {
                        doc.add(Field.Text(conf.mainFieldName, txt));
                }
                else doc.add(Field.UnStored(conf.mainFieldName, txt));

        }
===================================================================

In CrawlIndexer:
while(crawler.hasNext()) {
   CrawlDocumentHandler handler = getHandler(assoc, suffix, mime);
   ...
   doc = handler.getDocument(onlineDoc.getIn());
    if (doc != null) {
                doc.add(Field.Keyword("url", onlineDoc.getUrl()));
                Iterator writers =
config.getWritersForUrl(onlineDoc.getUrl()).iterator();
                while(writers.hasNext()) {
                        ((IndexWriter)writers.next()).addDocument(doc);
                }
                
        }
}
====================================================================

(I have a Set of Index Objects each storing its writer which isinitialised

like this, analyzer again comes from Config:

this.writer = new IndexWriter(dir, analyzer, true);

=====================================================================

Ok, now the index is made up with stored body text of thedocuments, each

analyzed with my Extension of GermanAnalyzer:


GermanHtmlAnalyzer extends Analyzer:
        public TokenStream tokenStream(String fieldName, Reader reader)  {
                try {
                        return new GermanAnalyzer().tokenStream(fieldName,
resolveEntities(reader));
                }
                catch(IOException ioe) {
                        return null;
                }               
        }

( resovleEntities returns a StringReader in which for exampleü or

&uuml; are replaced by 'ü')

========================================================================



SEARCH:

//Here some snippets of the code that provides the JavaBeans to bepassed to

some JSP page:

// By now the only implementation is HtmlFragmentDisplay
FragmentDisplay fragDisp =
(FragmentDisplay)Class.forName(displayClassName).newInstance();
IndexSearcher searcher = new IndexSearcher(dir);

Query q = MultiFieldQueryParser.parse(query, new String[]{"body","title"},

conf.getAnalyzer());
Hits hits = searcher.search(q);
for( [hits to be shown to the user] ) {
        ...
        if(conf.storeMainField()) {
                result.setFragment(fragDisp.getDisplayText(doc.get("body"),
q));
        }
        else result.setFragment(fragDisp.getDisplayText(new
URL(doc.get("url")), q));
        ...
        results.add(result);
}

===========================================================================


In HtmlFragmentDisplay:

public String getDisplayText(String bodyText, Query query) {

        QueryScorer scorer = new QueryScorer(query);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span
class=\"highlighted\">","</span>");
        Highlighter highlighter = new Highlighter(formatter, scorer);
        Fragmenter fragmenter = new SimpleFragmenter(60);               
        highlighter.setTextFragmenter(fragmenter);
        Analyzer analyzer = conf.getAnalyzer();
        TokenStream tStream = analyzer.tokenStream("body", new
StringReader(bodyText));
        return = highlighter.getBestFragments(tStream, bodyText, 4, " .....
");
}

(getDisplayText(URL url, Query query) fetches the document by itsURL, againuses the DocumentHandlers and finally calls the above method. Iswitchedfrom not storing the body text to storing it, but that didn'taffect the

highlighting problem.

===========================================================================


So...... Result.getFragment() is what the users sees on the JSP page.

If it happens to be taken from a JTidy-indexed Lucene document,everything

is well, if it comes from PdfBox, the wrong text is highlighted.

I also tried with QueryParser.parse() instead ofMultiFieldQueryParser, but

the output didn't change.

Many many thanks if you read until here!

And even more if you hava an idea where the error is likely to befound.


sonja

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Donnerstag, 8. Dezember 2005 10:59
To: java-user@lucene.apache.org
Subject: Re: pdf and highlighting

Sonja,

Do you have an example, or at least some relevant code, that
would help the community in helping resolve this?

        Erik

On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:


Hi, all!

I have a question concerning analysis and highlighting. I'm

indexing

multiple document formats (up to now, only html and pdf

occured, and

use the highlighter from the Lucene sandbox.
The documents text is extracted via JTidy and PDFBox, respectively,
then in both indexing and search analysed with a custom subclass of
GermanAnalyzer, which replaces character references and

html entities.


Now the funny thing is that, even if I store the body text,
highlighter uses wrong positions with lucene Docs stemming from pdf
documents, whereas html is hightlighted correctly.  I really don't
have an explanation for this behaviour - for

doc.get("body") is simply

text, in either case, and stop words were also removed in

ALL kinds of

documents (and through an instance of the same analyzer passed to
QueryParser.

Are there any hints to where I could find my error - or did anybody
else encounter the same problem?

Thanks in advance!

sonja

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdf and highlighting

Reply via email to