Re: BufferedUpdateStreams breaks high performance indexing

Michael McCandless Thu, 04 Aug 2016 06:03:14 -0700

Wonderful, thanks for bringing closure!

Mike McCandless


http://blog.mikemccandless.com

On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> After updating to version 5.5.3 it looks good now.
> Thanks a lot for your help and advise.
>
> Best regards
> Bernd
>
> Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> > The deleted terms accumulate whenever you use updateDocument(Term, Doc),
> or
> > when you do deleteDocuments(Term).
> >
> > Deleted queries are when you delete by query, but I don't think DIH would
> > be doing that unless you asked it to ... maybe a Solr user/dev knows
> better?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> > bernd.fehl...@uni-bielefeld.de> wrote:
> >
> >> Yes, with default of 10 it performs very much better.
> >> I didn't take into count that DIH uses updateDocument for adding new
> >> documents but after thinking about the "why" I assume that
> >> this might be because you don't know if a document already exists in the
> >> index.
> >> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> >> killer.
> >>
> >> One question still remains about messages in INFOSTREAM, I have lines
> >> saying
> >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> >> deleted queries
> >>            bytesUsed=2313024 delGen=2265 packetCount=69
> >> totBytesUsed=262526720
> >> ...
> >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> >> terms (unique count=0)
> >>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
> >>
> >>
> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
> >>             newDelCount=0
> >>
> >> Do you know what these deleted terms and deleted queries are?
> >>
> >> Best regards,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> >>> Hmm, your merge policy changes are dangerous: that will cause too many
> >>> segments in the index, which makes it longer to apply deletes.
> >>>
> >>> Can you revert that and re-test?
> >>>
> >>> I'm not sure why DIH is using updateDocument instead of addDocument ...
> >>> maybe ask on the solr-user list?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> >>> bernd.fehl...@uni-bielefeld.de> wrote:
> >>>
> >>>> Currently I use concurrent DIH but will write some SolrJ for testing
> >>>> or even as replacement for DIH.
> >>>> Don't know whats behind DIH if only documents are added.
> >>>>
> >>>> Not tried any newer release yet, but after reading LUCENE-6161 I
> really
> >>>> should.
> >>>> At least a version > 5.1
> >>>> May be before writing some SolrJ.
> >>>>
> >>>>
> >>>> Yes IndexWriterConfig is changed from default:
> >>>> <indexConfig>
> >>>>     <maxIndexingThreads>8</maxIndexingThreads>
> >>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>>>       <int name="maxMergeAtOnce">8</int>
> >>>>       <int name="segmentsPerTier">100</int>
> >>>>       <int name="maxMergedSegmentMB">512</int>
> >>>>     </mergePolicy>
> >>>>     <mergeFactor>8</mergeFactor>
> >>>>     <mergeScheduler
> >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>>>     <lockType>${solr.lock.type:native}</lockType>
> >>>>     ...
> >>>> </indexConfig>
> >>>>
> >>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> >>>> Somewhere between 20 and 50 characters in length.
> >>>>
> >>>> Thanks for your help,
> >>>> Bernd
> >>>>
> >>>>
> >>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> >>>>> Hmm not good.
> >>>>>
> >>>>> If you are really only adding documents, you should be using
> >>>>> IndexWriter.addDocument, which won't buffer any deleted terms and
> that
> >>>>> method call should be a no-op.  It also makes flushes more efficient
> >>>> since
> >>>>> all of your indexing buffer goes to the added documents, not buffered
> >>>>> delete terms.  Are you using updateDocument?
> >>>>>
> >>>>> Can you reproduce this slowness on a newer release?  There have been
> >>>>> performance issues fixed in newer releases in this method, e.g
> >>>>> https://issues.apache.org/jira/browse/LUCENE-6161
> >>>>>
> >>>>> Have you changed any IndexWriterConfig settings from defaults?
> >>>>>
> >>>>> What are your unique id fields like?  How many bytes in length?
> >>>>>
> >>>>> Mike McCandless
> >>>>>
> >>>>> http://blog.mikemccandless.com
> >>>>>
> >>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> >>>>> bernd.fehl...@uni-bielefeld.de> wrote:
> >>>>>
> >>>>>> While trying to get higher performance for indexing it turned out
> that
> >>>>>> BufferedUpdateStreams is breaking indexing performance.
> >>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>>>>>
> >>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> >>>> 4.10.4
> >>>>>> API states:
> >>>>>> "Determines the amount of RAM that may be used for buffering added
> >>>>>> documents and deletions before they are flushed to the Directory.
> >>>>>> Generally for faster indexing performance it's best to flush by RAM
> >>>>>> usage instead of document count and use as large a RAM buffer as you
> >>>> can."
> >>>>>>
> >>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>>>>>
> >>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]:
> applyDeletes:
> >>>>>> infos=...
> >>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]:
> applyDeletes
> >>>> took
> >>>>>> 3411845 msec
> >>>>>>
> >>>>>> About 56 minutes no indexing and only applying deletes.
> >>>>>> What is it deleting?
> >>>>>>
> >>>>>> If the index gets bigger the time gets longer, currently 2.5 hours
> of
> >>>>>> waiting.
> >>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >>>>>> deletes.
> >>>>>>
> >>>>>> Any suggestions which config is _really_ going for high performance
> >>>>>> indexing?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Bernd
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >> --
> >> *************************************************************
> >> Bernd Fehling                    Bielefeld University Library
> >> Dipl.-Inform. (FH)                LibTec - Library Technology
> >> Universitätsstr. 25                  and Knowledge Management
> >> 33615 Bielefeld
> >> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
> >>
> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
> >> *************************************************************
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: BufferedUpdateStreams breaks high performance indexing

Reply via email to