Also, could you kill your process with -QUIT (on Linux; maybe there is
something analogous on Windows?) when you see the threads hanging?
That will give a stack dump for every thread.
Mike
Grant Ingersoll wrote:
Can you describe your process a bit more? Are you measuring just
the Lucene part or the whole ingestion part as well? If it's the
latter, how do you know the issue is in Lucene? PDF extraction is
annoying at best and highly problematic at its worst. Not saying it
isn't Lucene, but I've seen PDFBox and other extractors fail a lot
more than I've seen Lucene fail.
Are there any exceptions that you are seeing anywhere in your log
files?
If you do have extraction as part of the process, what happens if
you separate out extraction from indexing? Does it fail when you
just index raw text in this manner?
Cheers,
Grant
On Oct 23, 2008, at 12:16 PM, Sudarsan, Sithu D. wrote:
Hi,
We are trying to index large collection of PDF documents, sizes
varying
from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox
for
text extraction) and on Windows as well as CentOS Linux. Used java -
Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows
and
32 GB on Linux with sufficient swap space.
With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for
each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have
completed
their indexing. The program never gracefully exits, and the threads
that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated.
Tried both with simple analyzer as well as standard analyzer, with
similar results.
Any useful tips / solutions welcome.
Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL
[EMAIL PROTECTED]
[EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]