Thank you! But here's what I have.
Today I looked at the indexer in the VisualVM, and I can definitely say that the problem is in the memory: the resourses (which mostly are Document fields) just don't go away. I tried different GCs (Parallel, CMS, the default one), and every time the behaviour is the same. As I pass my Documents into the indexWriter, I forget about them (the references are all local-scope), I think the resourses are stuck somewhere in the writer. I wonder now how do I see: - how many threads are used by the indexWriter? - when does it flush segments to disk? Can I also know whether the indexWriter is done with my Document? Is addDocument() operation some kind of synchronous? Do I need to call commit() frequently (I also need to keep segment size constant and use no merging)? -- Igor 23.11.2013, 20:29, "Daniel Penning" <dpenn...@gamona.de>: > G1 and CMS are both tuned primarily for low pauses which is typically > prefered for searching an index. In this case i guess that indexing > throughput is prefered in which case using ParallelGC might be the > better choice. > > Am 23.11.2013 17:15, schrieb Uwe Schindler: > >> Hi, >> >> Maybe your heap size is just too big, so your JVM spends too much time in >> GC? The setup you described in your last eMail ist the "official supported" >> setup :-) Lucene has no problem with that setup and can index. Be sure: >> - Don't give too much heap to your indexing app. Larger heaps create much >> more GC load. >> - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS >> Collector). Other garbage collectors may do GCs in a single thread >> ("stop-the-world"). >> >> Uwe >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >>> -----Original Message----- >>> From: Igor Shalyminov [mailto:ishalymi...@yandex-team.ru] >>> Sent: Saturday, November 23, 2013 4:46 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: Lucene multithreaded indexing problems >>> >>> So we return to the initially described setup: multiple parallel workers, >>> each >>> making "parse + indexWriter.addDocument()" for single documents with no >>> synchronization at my side. This setup was also bad on memory consumption >>> and thread blocking, as I reported. >>> >>> Or did I misunderstand you? >>> >>> -- >>> Igor >>> >>> 22.11.2013, 23:34, "Uwe Schindler" <u...@thetaphi.de>: >>>> Hi, >>>> Don't use addDocuments. This method is more made for so called block >>> indexing (where all documents need to be on a block for block joins). Call >>> addDocument for each document possibly from many threads. By this >>> Lucene can better handle multithreading and free memory early. There is >>> really no need to use bulk adds, this is solely for block joins, where >>> docs need >>> to be sequential and without gaps. >>>> Uwe >>>> >>>> Igor Shalyminov <ishalymi...@yandex-team.ru> schrieb: >>>>> - uwe@ >>>>> >>>>> Thanks Uwe! >>>>> >>>>> I changed the logic so that my workers only parse input docs into >>>>> Documents, and indexWriter does addDocuments() by itself for the >>>>> chunks of 100 Documents. >>>>> Unfortunately, this behaviour reproduces: memory usage slightly >>>>> increases with the number of processed documents, and at some point >>>>> the program runs very slowly, and it seems that only a single thread >>>>> is active. >>>>> It happens after lots of parse/index cycles. >>>>> >>>>> The current instance is now in the "single-thread" phase with ~100% >>>>> CPU and with 8397M RES memory (limit for the VM is -Xmx8G). >>>>> My question is, when does addDocuments() release all resourses passed >>>>> in (the Documents themselves)? >>>>> Are the resourses released after finishing the function call, or I >>>>> have to do indexWriter.commit() after, say, each chunk? >>>>> >>>>> -- >>>>> Igor >>>>> >>>>> 21.11.2013, 19:59, "Uwe Schindler" <u...@thetaphi.de>: >>>>>> Hi, >>>>>> >>>>>> why are you doing this? Lucene's IndexWriter can handle >>>>>> addDocuments >>>>> in multiple threads. And, since Lucene 4, it will process them almost >>>>> completely parallel! >>>>>> If you do the addDocuments single-threaded you are adding an >>>>> additional bottleneck in your application. If you are doing a >>>>> synchronization on IndexWriter (which I hope you will not do), things >>>>> will go wrong, too. >>>>>> Uwe >>>>>> >>>>>> ----- >>>>>> Uwe Schindler >>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>> http://www.thetaphi.de >>>>>> eMail: u...@thetaphi.de >>>>>>> -----Original Message----- >>>>>>> From: Igor Shalyminov [mailto:ishalymi...@yandex-team.ru] >>>>>>> Sent: Thursday, November 21, 2013 4:45 PM >>>>>>> To: java-user@lucene.apache.org >>>>>>> Subject: Lucene multithreaded indexing problems >>>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> I tried to perform indexing multithreadedly, with a >>>>>>> FixedThreadPool >>>>> of >>>>>>> Callable workers. >>>>>>> The main operation - parsing a single document and addDocument() >>>>>>> to >>>>> the >>>>>>> index - is done by a single worker. >>>>>>> After parsing a document, a lot (really a lot) of Strings >>>>>>> appears, >>>>> and at the >>>>>>> end of the worker's call() all of them goes to the indexWriter. >>>>>>> I use no merging, the resourses are flushed on disk when the >>>>> segment size >>>>>>> limit is reached. >>>>>>> >>>>>>> The problem is, after a little while (when the most of the heap >>>>> memory is >>>>>>> used) indexer makes no progress, and CPU load is constant 100% >>>>>>> (no >>>>>>> difference if there are 2 threads or 32). So I think at some >>>>>>> point >>>>> garbage >>>>>>> collection takes the whole indexing process down. >>>>>>> >>>>>>> Could you please give some advices on the proper concurrent >>>>> indexing with >>>>>>> Lucene? >>>>>>> Can there be "memory leaks" somewhere in the indexWriter? Maybe >>> I >>>>> must >>>>>>> perform some operations with writer to release unused resourses >>>>> from time >>>>>>> to time? >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Igor >>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> -------------------------------------------------------------------- >>>>>> - >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> -- >>>> Uwe Schindler >>>> H.-H.-Meier-Allee 63, 28213 Bremen >>>> http://www.thetaphi.de >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org