Re: Commit strategy for Heavy Bulk Indexing into solr

Pratik Patel Wed, 28 Jul 2021 08:20:59 -0700

Thanks Endika!

https://issues.apache.org/jira/browse/SOLR-14923


@DavidSmiley do you think this could be related to the issue I have
described?

I will certainly update our solr image but it will be good to know the root
cause of the issue. Your comment on this would be very helpful.

Thanks


On Wed, Jul 28, 2021 at 7:16 AM Endika Posadas <endikaposa...@gmail.com>
wrote:

> There were some big changes related to child indexing in solr 8.8, under
> this ticket: https://issues.apache.org/jira/browse/SOLR-14923
> It's worth updating solr to latest 8.8 and trying again, perhaps your
> indexing issue has already been fixed.
>
> On 2021/07/27 19:44:13, Pratik Patel <pra...@semandex.net> wrote:
> > So it looks like I have narrowed down where the problem is and have also
> > found a workaround but I would like to understand more.
> >
> > As I had mentioned, we have two stages in our bulk indexing operation.
> >
> > stage 1 : index Article documents [A1, A2.....An]
> > stage 2 : index Article documents with children [A1 with children, A2
> with
> > children......An with children]
> >
> > We were always running into issues in stage 2.
> > After some time in stage 2, *solrClient.add( <setOfBlockJoinDocs>,
> > commitWithin )* starts to timeout and then these timeouts happen
> > consistently. Even the socketTimeout of 30 mins was exceeded by add call
> > and we got socketTimeoutException.
> >
> > We have set commitWithin to be 6 hours to avoid unnecessary soft commits.
> > Auto commit interval is 1 min with openSearcher=false and autoSoftCommit
> > interval is 5 min.
> >
> > As mentioned above, we first index just the Articles in stage 1 and then
> in
> > stage 2, the same set of Articles are indexed with children (block
> join). I
> > had a suspicion that the huge amount of time taken by *solrClient.add*
> call
> > can have something to do with the *block join updates *that take place in
> > stage 2. Adding fresh joins of Articles with children on an empty
> > collection was much faster and ran without SocketTimeout. So I modified
> our
> > indexing pipeline to be as follows.
> >
> > 1. stage 1 : index Article documents [A1, A2.....An]
> > 2. delete all the Article documents
> > 3. stage 2 : index Article documents with children [A1 with children, A2
> > with children......An with children]
> >
> > With this change, stage 2 would be a simple *add operation and not an
> > update operation.* I tested the bulk indexing with this change and it
> > finished successfully without any issues in a shorter time period!
> >
> > It will be very helpful to know what is the difference between
> > A: When we add a document with children when collection does not already
> > have the same document
> > B: When we add a document with children when collection already has the
> > same document without children
> >
> > I understand that *update *takes place in B but how can we explain such a
> > difference in performance between A and B.
> >
> > Please note that we use RxJava and call solrClient.add() in parallel
> > threads with a set of Article documents and the socketTimeout issue seems
> > to pop up after we have already indexed about 90% of the documents.
> >
> > Some more clarity on what could be happening will be very useful.
> >
> > Thanks
> >
> > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net>
> wrote:
> >
> > > Hi All,
> > >
> > > *tl;dr* : running into long GC pauses and solr client socket timeouts
> > > when indexing bulk of documents into solr. Commit strategy in essence
> is to
> > > do hard commits at the interval of 50k documents (maxDocs=50k) and
> disable
> > > soft commit altogether during bulk indexing. Simple solr cloud set up
> with
> > > one node and one shard.
> > >
> > > *Details*:
> > > We have about 6 million documents which we are trying to index into
> solr.
> > > From these, about 500k documents have a text field which holds
> Abstracts of
> > > scientific papers/Articles. We extract keywords from these Abstracts
> and we
> > > index these keywords as well into solr.
> > >
> > > We have a many to many kind of relationship between Articles and
> keywords.
> > > To store this, we have following structure.
> > >
> > > Article documents
> > > Keyword documents
> > > Article-Keyword Join documents
> > >
> > > We use block join to index Articles with "Article-Keyword" join
> documents
> > > and Keyword documents are indexed independently.
> > >
> > > In other words, we have blocks of "Article + Article-Keyword Joins"
> and we
> > > have Keyword documents(they hold some additional metadata about
> keyword ).
> > >
> > > We have a bulk processing operation which creates these documents and
> > > indexes them into solr. During this bulk indexing, we don't need
> documents
> > > to be searchable. We need to search against them only after ALL the
> > > documents are indexed.
> > >
> > > *Based on this, this is our current strategy. *
> > > Soft commits are disabled and Hard commits are done at an interval of
> 50k
> > > documents with openSearcher=false. Our code triggers explicit commits 4
> > > times after various stages of bulk indexing. Transaction logs are
> enabled
> > > and have default settings.
> > >
> > >     <autoCommit>
> > >       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
> > >       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
> > >       <openSearcher>false</openSearcher>
> > >     </autoCommit>
> > >
> > >     <autoSoftCommit>
> > >       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> > >     </autoSoftCommit>
> > >
> > > Other Environmental Details:
> > > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> > > zkClienttimeout=2 mins
> > > Our indexing operation triggers many "add" operations in parallel using
> > > RxJava (15 to 30 threads) each "add" operation is passed about 1000
> > > documents.
> > >
> > > Currently, when we run this indexing operation, we notice that after a
> > > while solr goes into long GC pauses (longer than our sockeTimeout of 7
> > > minutes) and we get SocketTimeoutExceptions.
> > >
> > > *What could be causing such long GC pauses?*
> > >
> > > *Does this commit strategy make sense ? If not, what is the recommended
> > > strategy that we can look into? *
> > >
> > > *Any help on this is much appreciated. Thanks.*
> > >
> > >
> >
>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to