Indexing PDF document

jim shirreffs Wed, 06 Jun 2007 15:16:37 -0700

Well I got no where trying to index openoffice documents so I thought I tryindexing PDF documents. Seemed Like PDFBox was a good bet, claimed to offerLucene support and was on the Lucene recommended list. But after numeriousattempts failed I decided try the IndexFiles.java that comes with PDFBox andI get the same error my modified Lucene demo code gets.

C:\PDFBox-0.7.3\classes>javaorg.pdfbox.searchengine.lucene.IndexFiles -create -index c:\index c:\testroot=c:\test

Skipping c:\test\HTMLParser.java
Skipping c:\test\SearchFiles.java
Indexing PDF document: c:\test\doc.pdf

Exception in thread "main" java.lang.NoSuchMethodError:org.apache.lucene.document.Document.add(Lo

rg/apache/lucene/document/Field;)V

atorg.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224)atorg.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265)atorg.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)atorg.pdfbox.searchengine.lucene.IndexFiles.addDocument(IndexFiles.java:295)atorg.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:269)atorg.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:236)atorg.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:223)atorg.pdfbox.searchengine.lucene.IndexFiles.index(IndexFiles.java:165)atorg.pdfbox.searchengine.lucene.IndexFiles.main(IndexFiles.java:140)

This is quite curious since my code to index text documents does thissuscessfully


 /*
  * Add title
  */

document.add(new Field("title", title, Field.Store.YES,Field.Index.UN_TOKENIZED));


 And looking at the failing PDFBox code it is doing the EXACT SAME THING

 document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) );


Very strange since the exception is NoSuchMethod  Document.add(Field)

And my custom code doing a doc.add(Field) works but PDFBox's code doing adoc.add(Field) does not.


As a classpath problem check I tried this

public class IndexMain
{

public void indexDoc(String filename, String title, String objectId,String nodeId) throws Exception

    {
         File INDEX_DIR = new File("index");
         KcmiDocument kcmiDoc=null;
         Document pdfDocument=null;
         LucenePDFDocument lpdf = new LucenePDFDocument();

IndexWriter writer = new IndexWriter(INDEX_DIR, newStandardAnalyzer());


         File file = new File(filename);

         if (filename.endsWith("pdf"))
              pdfDocument = lpdf.getDocument(file);
         else
              kcmiDoc = new KcmiDocument(objectId, title);
}

Where KcmiDocument does the doc.add(Field) and lpdf.getDocument does thedoc.add(Field)

when I send in a .txt file all is well, when I send in a .pdf file theexception is thrown.

If anyone knows that I am doing wrong or of another easy method to extracttext from a pdf file I would centrainly like to know. I can live withoutopenoffice (for a while) but not being able to index pdf would be a Luceneshow stopper.



thanks
jim s















---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing PDF document

Reply via email to