use
> PDFBox to extract text. I prefer poppler tools.
>
> On Wed, Oct 13, 2010 at 2:22 PM, Ching wrote:
>
> > Hi,
> >
> > Thank you for your suggestions. I found the reason which is that PDFBox
> > seems having problem parsing large document (20MB), I have a
Hi,
Thank you for your suggestions. I found the reason which is that PDFBox
seems having problem parsing large document (20MB), I have a few of them
within those 2000 docs, those are the ones throwing OutOfMemory errors. The
app does exit, and JVM died. I am running on 32bit machine.
-- Ching
Hi All,
Can anyone help with this issue? I have about 2000 pdf files that I use
PDFBox to extract its text, then index them using for loop. The indexing
stopped after the fdt file reaches at 7,061 KB in size. There is no error,
the indexing simply stopped. Thanks in advance for any help.
Ching
More problem with NumericRangeQuery when combined it with BooleanQuery. Here
are the problem, please help.
1. I have a field of Date that is indexed as long
2. In the search, I need to exclude some time period, I used the
BooleanQuery to combined those excluded time periods like below,
BooleanQuer
Hi,
I have about 50 PDF douments with size of each is around 10MB. I am using
PDFbox for parsing, just wondering how I can index bookmarsk with its
corresponded page information?
I use PDDocumentOutline to get bookmark's title, but I only have
PDNamedDestination which offers no page number info. C
at the indexing time? Or, is there any technology we
need to integrate, like those for data warehousing? Any comments or
pointers will be greatly appreciated.
Thanks
Ching-pei