Re: Lucene multithreaded indexing problems

Igor Shalyminov Mon, 25 Nov 2013 07:21:25 -0800

Thank you!

But here's what I have.


Today I looked at the indexer in the VisualVM, and I can definitely say that 
the problem is in the memory: the resourses (which mostly are Document fields) 
just don't go away.
I tried different GCs (Parallel, CMS, the default one), and every time the 
behaviour is the same.
As I pass my Documents into the indexWriter, I forget about them (the 
references are all local-scope), I think the resourses are stuck somewhere in 
the writer.

I wonder now how do I see:
- how many threads are used by the indexWriter?
- when does it flush segments to disk?

Can I also know whether the indexWriter is done with my Document? Is 
addDocument() operation some kind of synchronous?
Do I need to call commit() frequently (I also need to keep segment size 
constant and use no merging)?

-- 
Igor

23.11.2013, 20:29, "Daniel Penning" <dpenn...@gamona.de>:
> G1 and CMS are both tuned primarily for low pauses which is typically
> prefered for searching an index. In this case i guess that indexing
> throughput is prefered in which case using ParallelGC might be the
> better choice.
>
> Am 23.11.2013 17:15, schrieb Uwe Schindler:
>
>>  Hi,
>>
>>  Maybe your heap size is just too big, so your JVM spends too much time in 
>> GC? The setup you described in your last eMail ist the "official supported" 
>> setup :-) Lucene has no problem with that setup and can index. Be sure:
>>  - Don't give too much heap to your indexing app. Larger heaps create much 
>> more GC load.
>>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS 
>> Collector). Other garbage collectors may do GCs in a single thread 
>> ("stop-the-world").
>>
>>  Uwe
>>  -----
>>  Uwe Schindler
>>  H.-H.-Meier-Allee 63, D-28213 Bremen
>>  http://www.thetaphi.de
>>  eMail: u...@thetaphi.de
>>>  -----Original Message-----
>>>  From: Igor Shalyminov [mailto:ishalymi...@yandex-team.ru]
>>>  Sent: Saturday, November 23, 2013 4:46 PM
>>>  To: java-user@lucene.apache.org
>>>  Subject: Re: Lucene multithreaded indexing problems
>>>
>>>  So we return to the initially described setup: multiple parallel workers, 
>>> each
>>>  making "parse + indexWriter.addDocument()" for single documents with no
>>>  synchronization at my side. This setup was also bad on memory consumption
>>>  and thread blocking, as I reported.
>>>
>>>  Or did I misunderstand you?
>>>
>>>  --
>>>  Igor
>>>
>>>  22.11.2013, 23:34, "Uwe Schindler" <u...@thetaphi.de>:
>>>>  Hi,
>>>>  Don't use addDocuments. This method is more made for so called block
>>>  indexing (where all documents need to be on a block for block joins). Call
>>>  addDocument for each document possibly from many threads.  By this
>>>  Lucene can better handle multithreading and free memory early. There is
>>>  really no need to use bulk adds, this is solely for block joins, where 
>>> docs need
>>>  to be sequential and without gaps.
>>>>  Uwe
>>>>
>>>>  Igor Shalyminov <ishalymi...@yandex-team.ru> schrieb:
>>>>>  - uwe@
>>>>>
>>>>>  Thanks Uwe!
>>>>>
>>>>>  I changed the logic so that my workers only parse input docs into
>>>>>  Documents, and indexWriter does addDocuments() by itself for the
>>>>>  chunks of 100 Documents.
>>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
>>>>>  increases with the number of processed documents, and at some point
>>>>>  the program runs very slowly, and it seems that only a single thread
>>>>>  is active.
>>>>>  It happens after lots of parse/index cycles.
>>>>>
>>>>>  The current instance is now in the "single-thread" phase with ~100%
>>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>>>  My question is, when does addDocuments() release all resourses passed
>>>>>  in (the Documents themselves)?
>>>>>  Are the resourses released after finishing the function call, or I
>>>>>  have to do indexWriter.commit() after, say, each chunk?
>>>>>
>>>>>  --
>>>>>  Igor
>>>>>
>>>>>  21.11.2013, 19:59, "Uwe Schindler" <u...@thetaphi.de>:
>>>>>>    Hi,
>>>>>>
>>>>>>    why are you doing this? Lucene's IndexWriter can handle
>>>>>>  addDocuments
>>>>>  in multiple threads. And, since Lucene 4, it will process them almost
>>>>>  completely parallel!
>>>>>>    If you do the addDocuments single-threaded you are adding an
>>>>>  additional bottleneck in your application. If you are doing a
>>>>>  synchronization on IndexWriter (which I hope you will not do), things
>>>>>  will go wrong, too.
>>>>>>    Uwe
>>>>>>
>>>>>>    -----
>>>>>>    Uwe Schindler
>>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>    http://www.thetaphi.de
>>>>>>    eMail: u...@thetaphi.de
>>>>>>>     -----Original Message-----
>>>>>>>     From: Igor Shalyminov [mailto:ishalymi...@yandex-team.ru]
>>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>>     To: java-user@lucene.apache.org
>>>>>>>     Subject: Lucene multithreaded indexing problems
>>>>>>>
>>>>>>>     Hello!
>>>>>>>
>>>>>>>     I tried to perform indexing multithreadedly, with a
>>>>>>>  FixedThreadPool
>>>>>  of
>>>>>>>     Callable workers.
>>>>>>>     The main operation - parsing a single document and addDocument()
>>>>>>>  to
>>>>>  the
>>>>>>>     index - is done by a single worker.
>>>>>>>     After parsing a document, a lot (really a lot) of Strings
>>>>>>>  appears,
>>>>>  and at the
>>>>>>>     end of the worker's call() all of them goes to the indexWriter.
>>>>>>>     I use no merging, the resourses are flushed on disk when the
>>>>>  segment size
>>>>>>>     limit is reached.
>>>>>>>
>>>>>>>     The problem is, after a little while (when the most of the heap
>>>>>  memory is
>>>>>>>     used) indexer makes no progress, and CPU load is constant 100%
>>>>>>>  (no
>>>>>>>     difference if there are 2 threads or 32). So I think at some
>>>>>>>  point
>>>>>  garbage
>>>>>>>     collection takes the whole indexing process down.
>>>>>>>
>>>>>>>     Could you please give some advices on the proper concurrent
>>>>>  indexing with
>>>>>>>     Lucene?
>>>>>>>     Can there be "memory leaks" somewhere in the indexWriter? Maybe
>>>  I
>>>>>  must
>>>>>>>     perform some operations with writer to release unused resourses
>>>>>  from time
>>>>>>>     to time?
>>>>>>>
>>>>>>>     --
>>>>>>>     Best Regards,
>>>>>>>     Igor
>>>>>  ---------------------------------------------------------------------
>>>>>>>     To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>     For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>  --------------------------------------------------------------------
>>>>>>  -
>>>>>>    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>  ---------------------------------------------------------------------
>>>>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>  --
>>>>  Uwe Schindler
>>>>  H.-H.-Meier-Allee 63, 28213 Bremen
>>>>  http://www.thetaphi.de
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene multithreaded indexing problems

Reply via email to