The deleted terms accumulate whenever you use updateDocument(Term, Doc), or when you do deleteDocuments(Term).
Deleted queries are when you delete by query, but I don't think DIH would be doing that unless you asked it to ... maybe a Solr user/dev knows better? Mike McCandless http://blog.mikemccandless.com On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > Yes, with default of 10 it performs very much better. > I didn't take into count that DIH uses updateDocument for adding new > documents but after thinking about the "why" I assume that > this might be because you don't know if a document already exists in the > index. > Conclusion, using DIH and setting segmentsPerTier to a high value is a > killer. > > One question still remains about messages in INFOSTREAM, I have lines > saying > BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 > deleted queries > bytesUsed=2313024 delGen=2265 packetCount=69 > totBytesUsed=262526720 > ... > BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted > terms (unique count=0) > 97142 deleted queries bytesUsed=3108576]; coalesced deletes= > > > [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] > newDelCount=0 > > Do you know what these deleted terms and deleted queries are? > > Best regards, > Bernd > > > Am 28.07.2016 um 17:34 schrieb Michael McCandless: > > Hmm, your merge policy changes are dangerous: that will cause too many > > segments in the index, which makes it longer to apply deletes. > > > > Can you revert that and re-test? > > > > I'm not sure why DIH is using updateDocument instead of addDocument ... > > maybe ask on the solr-user list? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < > > bernd.fehl...@uni-bielefeld.de> wrote: > > > >> Currently I use concurrent DIH but will write some SolrJ for testing > >> or even as replacement for DIH. > >> Don't know whats behind DIH if only documents are added. > >> > >> Not tried any newer release yet, but after reading LUCENE-6161 I really > >> should. > >> At least a version > 5.1 > >> May be before writing some SolrJ. > >> > >> > >> Yes IndexWriterConfig is changed from default: > >> <indexConfig> > >> <maxIndexingThreads>8</maxIndexingThreads> > >> <ramBufferSizeMB>1024</ramBufferSizeMB> > >> <maxBufferedDocs>-1</maxBufferedDocs> > >> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > >> <int name="maxMergeAtOnce">8</int> > >> <int name="segmentsPerTier">100</int> > >> <int name="maxMergedSegmentMB">512</int> > >> </mergePolicy> > >> <mergeFactor>8</mergeFactor> > >> <mergeScheduler > >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> > >> <lockType>${solr.lock.type:native}</lockType> > >> ... > >> </indexConfig> > >> > >> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" > >> Somewhere between 20 and 50 characters in length. > >> > >> Thanks for your help, > >> Bernd > >> > >> > >> Am 28.07.2016 um 15:35 schrieb Michael McCandless: > >>> Hmm not good. > >>> > >>> If you are really only adding documents, you should be using > >>> IndexWriter.addDocument, which won't buffer any deleted terms and that > >>> method call should be a no-op. It also makes flushes more efficient > >> since > >>> all of your indexing buffer goes to the added documents, not buffered > >>> delete terms. Are you using updateDocument? > >>> > >>> Can you reproduce this slowness on a newer release? There have been > >>> performance issues fixed in newer releases in this method, e.g > >>> https://issues.apache.org/jira/browse/LUCENE-6161 > >>> > >>> Have you changed any IndexWriterConfig settings from defaults? > >>> > >>> What are your unique id fields like? How many bytes in length? > >>> > >>> Mike McCandless > >>> > >>> http://blog.mikemccandless.com > >>> > >>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < > >>> bernd.fehl...@uni-bielefeld.de> wrote: > >>> > >>>> While trying to get higher performance for indexing it turned out that > >>>> BufferedUpdateStreams is breaking indexing performance. > >>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) > >>>> > >>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene > >> 4.10.4 > >>>> API states: > >>>> "Determines the amount of RAM that may be used for buffering added > >>>> documents and deletions before they are flushed to the Directory. > >>>> Generally for faster indexing performance it's best to flush by RAM > >>>> usage instead of document count and use as large a RAM buffer as you > >> can." > >>>> > >>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1. > >>>> > >>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: > >>>> infos=... > >>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes > >> took > >>>> 3411845 msec > >>>> > >>>> About 56 minutes no indexing and only applying deletes. > >>>> What is it deleting? > >>>> > >>>> If the index gets bigger the time gets longer, currently 2.5 hours of > >>>> waiting. > >>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no > >>>> deletes. > >>>> > >>>> Any suggestions which config is _really_ going for high performance > >>>> indexing? > >>>> > >>>> Best regards, > >>>> Bernd > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>>> > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > -- > ************************************************************* > Bernd Fehling Bielefeld University Library > Dipl.-Inform. (FH) LibTec - Library Technology > Universitätsstr. 25 and Knowledge Management > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ************************************************************* > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >