Also, could you kill your process with -QUIT (on Linux; maybe there is something analogous on Windows?) when you see the threads hanging? That will give a stack dump for every thread.

Mike

Grant Ingersoll wrote:

Can you describe your process a bit more? Are you measuring just the Lucene part or the whole ingestion part as well? If it's the latter, how do you know the issue is in Lucene? PDF extraction is annoying at best and highly problematic at its worst. Not saying it isn't Lucene, but I've seen PDFBox and other extractors fail a lot more than I've seen Lucene fail.

Are there any exceptions that you are seeing anywhere in your log files?

If you do have extraction as part of the process, what happens if you separate out extraction from indexing? Does it fail when you just index raw text in this manner?

Cheers,
Grant


On Oct 23, 2008, at 12:16 PM, Sudarsan, Sithu D. wrote:


Hi,

We are trying to index large collection of PDF documents, sizes varying from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for text extraction) and on Windows as well as CentOS Linux. Used java - Xms and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.

With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated.

Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to