Wonderful, thanks for bringing closure! Mike McCandless
http://blog.mikemccandless.com On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > After updating to version 5.5.3 it looks good now. > Thanks a lot for your help and advise. > > Best regards > Bernd > > Am 29.07.2016 um 15:04 schrieb Michael McCandless: > > The deleted terms accumulate whenever you use updateDocument(Term, Doc), > or > > when you do deleteDocuments(Term). > > > > Deleted queries are when you delete by query, but I don't think DIH would > > be doing that unless you asked it to ... maybe a Solr user/dev knows > better? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling < > > bernd.fehl...@uni-bielefeld.de> wrote: > > > >> Yes, with default of 10 it performs very much better. > >> I didn't take into count that DIH uses updateDocument for adding new > >> documents but after thinking about the "why" I assume that > >> this might be because you don't know if a document already exists in the > >> index. > >> Conclusion, using DIH and setting segmentsPerTier to a high value is a > >> killer. > >> > >> One question still remains about messages in INFOSTREAM, I have lines > >> saying > >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 > >> deleted queries > >> bytesUsed=2313024 delGen=2265 packetCount=69 > >> totBytesUsed=262526720 > >> ... > >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted > >> terms (unique count=0) > >> 97142 deleted queries bytesUsed=3108576]; coalesced deletes= > >> > >> > [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] > >> newDelCount=0 > >> > >> Do you know what these deleted terms and deleted queries are? > >> > >> Best regards, > >> Bernd > >> > >> > >> Am 28.07.2016 um 17:34 schrieb Michael McCandless: > >>> Hmm, your merge policy changes are dangerous: that will cause too many > >>> segments in the index, which makes it longer to apply deletes. > >>> > >>> Can you revert that and re-test? > >>> > >>> I'm not sure why DIH is using updateDocument instead of addDocument ... > >>> maybe ask on the solr-user list? > >>> > >>> Mike McCandless > >>> > >>> http://blog.mikemccandless.com > >>> > >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < > >>> bernd.fehl...@uni-bielefeld.de> wrote: > >>> > >>>> Currently I use concurrent DIH but will write some SolrJ for testing > >>>> or even as replacement for DIH. > >>>> Don't know whats behind DIH if only documents are added. > >>>> > >>>> Not tried any newer release yet, but after reading LUCENE-6161 I > really > >>>> should. > >>>> At least a version > 5.1 > >>>> May be before writing some SolrJ. > >>>> > >>>> > >>>> Yes IndexWriterConfig is changed from default: > >>>> <indexConfig> > >>>> <maxIndexingThreads>8</maxIndexingThreads> > >>>> <ramBufferSizeMB>1024</ramBufferSizeMB> > >>>> <maxBufferedDocs>-1</maxBufferedDocs> > >>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > >>>> <int name="maxMergeAtOnce">8</int> > >>>> <int name="segmentsPerTier">100</int> > >>>> <int name="maxMergedSegmentMB">512</int> > >>>> </mergePolicy> > >>>> <mergeFactor>8</mergeFactor> > >>>> <mergeScheduler > >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> > >>>> <lockType>${solr.lock.type:native}</lockType> > >>>> ... > >>>> </indexConfig> > >>>> > >>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" > >>>> Somewhere between 20 and 50 characters in length. > >>>> > >>>> Thanks for your help, > >>>> Bernd > >>>> > >>>> > >>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless: > >>>>> Hmm not good. > >>>>> > >>>>> If you are really only adding documents, you should be using > >>>>> IndexWriter.addDocument, which won't buffer any deleted terms and > that > >>>>> method call should be a no-op. It also makes flushes more efficient > >>>> since > >>>>> all of your indexing buffer goes to the added documents, not buffered > >>>>> delete terms. Are you using updateDocument? > >>>>> > >>>>> Can you reproduce this slowness on a newer release? There have been > >>>>> performance issues fixed in newer releases in this method, e.g > >>>>> https://issues.apache.org/jira/browse/LUCENE-6161 > >>>>> > >>>>> Have you changed any IndexWriterConfig settings from defaults? > >>>>> > >>>>> What are your unique id fields like? How many bytes in length? > >>>>> > >>>>> Mike McCandless > >>>>> > >>>>> http://blog.mikemccandless.com > >>>>> > >>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < > >>>>> bernd.fehl...@uni-bielefeld.de> wrote: > >>>>> > >>>>>> While trying to get higher performance for indexing it turned out > that > >>>>>> BufferedUpdateStreams is breaking indexing performance. > >>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) > >>>>>> > >>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene > >>>> 4.10.4 > >>>>>> API states: > >>>>>> "Determines the amount of RAM that may be used for buffering added > >>>>>> documents and deletions before they are flushed to the Directory. > >>>>>> Generally for faster indexing performance it's best to flush by RAM > >>>>>> usage instead of document count and use as large a RAM buffer as you > >>>> can." > >>>>>> > >>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1. > >>>>>> > >>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: > applyDeletes: > >>>>>> infos=... > >>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: > applyDeletes > >>>> took > >>>>>> 3411845 msec > >>>>>> > >>>>>> About 56 minutes no indexing and only applying deletes. > >>>>>> What is it deleting? > >>>>>> > >>>>>> If the index gets bigger the time gets longer, currently 2.5 hours > of > >>>>>> waiting. > >>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no > >>>>>> deletes. > >>>>>> > >>>>>> Any suggestions which config is _really_ going for high performance > >>>>>> indexing? > >>>>>> > >>>>>> Best regards, > >>>>>> Bernd > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>>> > >>> > >> > >> -- > >> ************************************************************* > >> Bernd Fehling Bielefeld University Library > >> Dipl.-Inform. (FH) LibTec - Library Technology > >> Universitätsstr. 25 and Knowledge Management > >> 33615 Bielefeld > >> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > >> > >> BASE - Bielefeld Academic Search Engine - www.base-search.net > >> ************************************************************* > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > -- > ************************************************************* > Bernd Fehling Bielefeld University Library > Dipl.-Inform. (FH) LibTec - Library Technology > Universitätsstr. 25 and Knowledge Management > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ************************************************************* > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >