Ching wrote:
> I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
> which tools do you use to extract text from pdf? Thanks.
Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the Unicode
characters o
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
which tools do you use to extract text from pdf? Thanks.
On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes wrote:
> What version of PDFBox are you running?
> PDFBox 0.72 does not work properly with some pdf documents. See more
What version of PDFBox are you running?
PDFBox 0.72 does not work properly with some pdf documents. See more in
https://issues.apache.org/jira/browse/PDFBOX-361.
So, I wrote a extractor (a copy of the original, in fact) based on trunk
version (1.2.1, actually). Furthermore, this version is faster e
Hi,
Thank you for your suggestions. I found the reason which is that PDFBox
seems having problem parsing large document (20MB), I have a few of them
within those 2000 docs, those are the ones throwing OutOfMemory errors. The
app does exit, and JVM died. I am running on 32bit machine.
-- Ching
On
Hi Ching
I donot think issue with Lucene for 2000 documents. As Anshum mentioned,
give more details about environment.
And check what CPU usage and index fdt file timestamp while it hangs. And
using logs would help to identify real cause. I used to work with Lucene 2.4
and recently 3.0.2. No sim
Hi Ching,
Does the app exit or hang and stay there? as in does the JVM stay alive and
idle?
Also, can you make sure that its not the pdfbox? as in, try commenting the
indexwriter part and just read the pdfs, does that work fine.
Can you also post info on your environment?
Index Size? Lucene Version