After updating to version 5.5.3 it looks good now. Thanks a lot for your help and advise.
Best regards Bernd Am 29.07.2016 um 15:04 schrieb Michael McCandless: > The deleted terms accumulate whenever you use updateDocument(Term, Doc), or > when you do deleteDocuments(Term). > > Deleted queries are when you delete by query, but I don't think DIH would > be doing that unless you asked it to ... maybe a Solr user/dev knows better? > > Mike McCandless > > http://blog.mikemccandless.com > > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> Yes, with default of 10 it performs very much better. >> I didn't take into count that DIH uses updateDocument for adding new >> documents but after thinking about the "why" I assume that >> this might be because you don't know if a document already exists in the >> index. >> Conclusion, using DIH and setting segmentsPerTier to a high value is a >> killer. >> >> One question still remains about messages in INFOSTREAM, I have lines >> saying >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 >> deleted queries >> bytesUsed=2313024 delGen=2265 packetCount=69 >> totBytesUsed=262526720 >> ... >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted >> terms (unique count=0) >> 97142 deleted queries bytesUsed=3108576]; coalesced deletes= >> >> >> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] >> newDelCount=0 >> >> Do you know what these deleted terms and deleted queries are? >> >> Best regards, >> Bernd >> >> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless: >>> Hmm, your merge policy changes are dangerous: that will cause too many >>> segments in the index, which makes it longer to apply deletes. >>> >>> Can you revert that and re-test? >>> >>> I'm not sure why DIH is using updateDocument instead of addDocument ... >>> maybe ask on the solr-user list? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < >>> bernd.fehl...@uni-bielefeld.de> wrote: >>> >>>> Currently I use concurrent DIH but will write some SolrJ for testing >>>> or even as replacement for DIH. >>>> Don't know whats behind DIH if only documents are added. >>>> >>>> Not tried any newer release yet, but after reading LUCENE-6161 I really >>>> should. >>>> At least a version > 5.1 >>>> May be before writing some SolrJ. >>>> >>>> >>>> Yes IndexWriterConfig is changed from default: >>>> <indexConfig> >>>> <maxIndexingThreads>8</maxIndexingThreads> >>>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>>> <maxBufferedDocs>-1</maxBufferedDocs> >>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> >>>> <int name="maxMergeAtOnce">8</int> >>>> <int name="segmentsPerTier">100</int> >>>> <int name="maxMergedSegmentMB">512</int> >>>> </mergePolicy> >>>> <mergeFactor>8</mergeFactor> >>>> <mergeScheduler >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> >>>> <lockType>${solr.lock.type:native}</lockType> >>>> ... >>>> </indexConfig> >>>> >>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" >>>> Somewhere between 20 and 50 characters in length. >>>> >>>> Thanks for your help, >>>> Bernd >>>> >>>> >>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless: >>>>> Hmm not good. >>>>> >>>>> If you are really only adding documents, you should be using >>>>> IndexWriter.addDocument, which won't buffer any deleted terms and that >>>>> method call should be a no-op. It also makes flushes more efficient >>>> since >>>>> all of your indexing buffer goes to the added documents, not buffered >>>>> delete terms. Are you using updateDocument? >>>>> >>>>> Can you reproduce this slowness on a newer release? There have been >>>>> performance issues fixed in newer releases in this method, e.g >>>>> https://issues.apache.org/jira/browse/LUCENE-6161 >>>>> >>>>> Have you changed any IndexWriterConfig settings from defaults? >>>>> >>>>> What are your unique id fields like? How many bytes in length? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < >>>>> bernd.fehl...@uni-bielefeld.de> wrote: >>>>> >>>>>> While trying to get higher performance for indexing it turned out that >>>>>> BufferedUpdateStreams is breaking indexing performance. >>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) >>>>>> >>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene >>>> 4.10.4 >>>>>> API states: >>>>>> "Determines the amount of RAM that may be used for buffering added >>>>>> documents and deletions before they are flushed to the Directory. >>>>>> Generally for faster indexing performance it's best to flush by RAM >>>>>> usage instead of document count and use as large a RAM buffer as you >>>> can." >>>>>> >>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1. >>>>>> >>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: >>>>>> infos=... >>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes >>>> took >>>>>> 3411845 msec >>>>>> >>>>>> About 56 minutes no indexing and only applying deletes. >>>>>> What is it deleting? >>>>>> >>>>>> If the index gets bigger the time gets longer, currently 2.5 hours of >>>>>> waiting. >>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no >>>>>> deletes. >>>>>> >>>>>> Any suggestions which config is _really_ going for high performance >>>>>> indexing? >>>>>> >>>>>> Best regards, >>>>>> Bernd >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >> >> -- >> ************************************************************* >> Bernd Fehling Bielefeld University Library >> Dipl.-Inform. (FH) LibTec - Library Technology >> Universitätsstr. 25 and Knowledge Management >> 33615 Bielefeld >> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de >> >> BASE - Bielefeld Academic Search Engine - www.base-search.net >> ************************************************************* >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net ************************************************************* --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org