Re: Indexing is hung or doesn't complete

2010-10-13 Thread Bill Janssen
Ching wrote: > I use PDFBox version 1.1.0; I did find a workaround now. Just wondering > which tools do you use to extract text from pdf? Thanks. Ching, in UpLib I use a patched version of xpdf which reports the bounding box and font information for each word (as well as the Unicode characters o

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Ching
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering which tools do you use to extract text from pdf? Thanks. On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes wrote: > What version of PDFBox are you running? > PDFBox 0.72 does not work properly with some pdf documents. See more

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Fabiano Nunes
What version of PDFBox are you running? PDFBox 0.72 does not work properly with some pdf documents. See more in https://issues.apache.org/jira/browse/PDFBOX-361. So, I wrote a extractor (a copy of the original, in fact) based on trunk version (1.2.1, actually). Furthermore, this version is faster e

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Ching
Hi, Thank you for your suggestions. I found the reason which is that PDFBox seems having problem parsing large document (20MB), I have a few of them within those 2000 docs, those are the ones throwing OutOfMemory errors. The app does exit, and JVM died. I am running on 32bit machine. -- Ching On

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Senthil
Hi Ching I donot think issue with Lucene for 2000 documents. As Anshum mentioned, give more details about environment. And check what CPU usage and index fdt file timestamp while it hangs. And using logs would help to identify real cause. I used to work with Lucene 2.4 and recently 3.0.2. No sim

Re: Indexing is hung or doesn't complete

2010-10-12 Thread Anshum
Hi Ching, Does the app exit or hang and stay there? as in does the JVM stay alive and idle? Also, can you make sure that its not the pdfbox? as in, try commenting the indexwriter part and just read the pdfs, does that work fine. Can you also post info on your environment? Index Size? Lucene Version