Yes, with default of 10 it performs very much better. I didn't take into count that DIH uses updateDocument for adding new documents but after thinking about the "why" I assume that this might be because you don't know if a document already exists in the index. Conclusion, using DIH and setting segmentsPerTier to a high value is a killer.
One question still remains about messages in INFOSTREAM, I have lines saying BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 deleted queries bytesUsed=2313024 delGen=2265 packetCount=69 totBytesUsed=262526720 ... BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted terms (unique count=0) 97142 deleted queries bytesUsed=3108576]; coalesced deletes= [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] newDelCount=0 Do you know what these deleted terms and deleted queries are? Best regards, Bernd Am 28.07.2016 um 17:34 schrieb Michael McCandless: > Hmm, your merge policy changes are dangerous: that will cause too many > segments in the index, which makes it longer to apply deletes. > > Can you revert that and re-test? > > I'm not sure why DIH is using updateDocument instead of addDocument ... > maybe ask on the solr-user list? > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> Currently I use concurrent DIH but will write some SolrJ for testing >> or even as replacement for DIH. >> Don't know whats behind DIH if only documents are added. >> >> Not tried any newer release yet, but after reading LUCENE-6161 I really >> should. >> At least a version > 5.1 >> May be before writing some SolrJ. >> >> >> Yes IndexWriterConfig is changed from default: >> <indexConfig> >> <maxIndexingThreads>8</maxIndexingThreads> >> <ramBufferSizeMB>1024</ramBufferSizeMB> >> <maxBufferedDocs>-1</maxBufferedDocs> >> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> >> <int name="maxMergeAtOnce">8</int> >> <int name="segmentsPerTier">100</int> >> <int name="maxMergedSegmentMB">512</int> >> </mergePolicy> >> <mergeFactor>8</mergeFactor> >> <mergeScheduler >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> >> <lockType>${solr.lock.type:native}</lockType> >> ... >> </indexConfig> >> >> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" >> Somewhere between 20 and 50 characters in length. >> >> Thanks for your help, >> Bernd >> >> >> Am 28.07.2016 um 15:35 schrieb Michael McCandless: >>> Hmm not good. >>> >>> If you are really only adding documents, you should be using >>> IndexWriter.addDocument, which won't buffer any deleted terms and that >>> method call should be a no-op. It also makes flushes more efficient >> since >>> all of your indexing buffer goes to the added documents, not buffered >>> delete terms. Are you using updateDocument? >>> >>> Can you reproduce this slowness on a newer release? There have been >>> performance issues fixed in newer releases in this method, e.g >>> https://issues.apache.org/jira/browse/LUCENE-6161 >>> >>> Have you changed any IndexWriterConfig settings from defaults? >>> >>> What are your unique id fields like? How many bytes in length? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < >>> bernd.fehl...@uni-bielefeld.de> wrote: >>> >>>> While trying to get higher performance for indexing it turned out that >>>> BufferedUpdateStreams is breaking indexing performance. >>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) >>>> >>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene >> 4.10.4 >>>> API states: >>>> "Determines the amount of RAM that may be used for buffering added >>>> documents and deletions before they are flushed to the Directory. >>>> Generally for faster indexing performance it's best to flush by RAM >>>> usage instead of document count and use as large a RAM buffer as you >> can." >>>> >>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1. >>>> >>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: >>>> infos=... >>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes >> took >>>> 3411845 msec >>>> >>>> About 56 minutes no indexing and only applying deletes. >>>> What is it deleting? >>>> >>>> If the index gets bigger the time gets longer, currently 2.5 hours of >>>> waiting. >>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no >>>> deletes. >>>> >>>> Any suggestions which config is _really_ going for high performance >>>> indexing? >>>> >>>> Best regards, >>>> Bernd >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net ************************************************************* --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org