Well I got no where trying to index openoffice documents so I thought I try indexing PDF documents. Seemed Like PDFBox was a good bet, claimed to offer Lucene support and was on the Lucene recommended list. But after numerious attempts failed I decided try the IndexFiles.java that comes with PDFBox and I get the same error my modified Lucene demo code gets.

C:\PDFBox-0.7.3\classes>java org.pdfbox.searchengine.lucene.IndexFiles -create -index c:\index c:\test root=c:\test
Skipping c:\test\HTMLParser.java
Skipping c:\test\SearchFiles.java
Indexing PDF document: c:\test\doc.pdf
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.lucene.document.Document.add(Lo
rg/apache/lucene/document/Field;)V
at org.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224) at org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265) at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377) at org.pdfbox.searchengine.lucene.IndexFiles.addDocument(IndexFiles.java:295) at org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:269) at org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:236) at org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:223) at org.pdfbox.searchengine.lucene.IndexFiles.index(IndexFiles.java:165) at org.pdfbox.searchengine.lucene.IndexFiles.main(IndexFiles.java:140)


This is quite curious since my code to index text documents does this suscessfully

 /*
  * Add title
  */
document.add(new Field("title", title, Field.Store.YES, Field.Index.UN_TOKENIZED));

 And looking at the failing PDFBox code it is doing the EXACT SAME THING

 document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) );


Very strange since the exception is NoSuchMethod  Document.add(Field)

And my custom code doing a doc.add(Field) works but PDFBox's code doing a doc.add(Field) does not.

As a classpath problem check I tried this

public class IndexMain
{
public void indexDoc(String filename, String title, String objectId, String nodeId) throws Exception
    {
         File INDEX_DIR = new File("index");
         KcmiDocument kcmiDoc=null;
         Document pdfDocument=null;
         LucenePDFDocument lpdf = new LucenePDFDocument();

IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer());

         File file = new File(filename);

         if (filename.endsWith("pdf"))
              pdfDocument = lpdf.getDocument(file);
         else
              kcmiDoc = new KcmiDocument(objectId, title);
}

Where KcmiDocument does the doc.add(Field) and lpdf.getDocument does the doc.add(Field)

when I send in a .txt file all is well, when I send in a .pdf file the exception is thrown.

If anyone knows that I am doing wrong or of another easy method to extract text from a pdf file I would centrainly like to know. I can live without openoffice (for a while) but not being able to index pdf would be a Lucene show stopper.


thanks
jim s















---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to