Re: Commit strategy for Heavy Bulk Indexing into solr

Endika Posadas Wed, 28 Jul 2021 04:17:31 -0700

There were some big changes related to child indexing in solr 8.8, under this 
ticket: https://issues.apache.org/jira/browse/SOLR-14923 
It's worth updating solr to latest 8.8 and trying again, perhaps your indexing 
issue has already been fixed.


On 2021/07/27 19:44:13, Pratik Patel <pra...@semandex.net> wrote: 
> So it looks like I have narrowed down where the problem is and have also
> found a workaround but I would like to understand more.
> 
> As I had mentioned, we have two stages in our bulk indexing operation.
> 
> stage 1 : index Article documents [A1, A2.....An]
> stage 2 : index Article documents with children [A1 with children, A2 with
> children......An with children]
> 
> We were always running into issues in stage 2.
> After some time in stage 2, *solrClient.add( <setOfBlockJoinDocs>,
> commitWithin )* starts to timeout and then these timeouts happen
> consistently. Even the socketTimeout of 30 mins was exceeded by add call
> and we got socketTimeoutException.
> 
> We have set commitWithin to be 6 hours to avoid unnecessary soft commits.
> Auto commit interval is 1 min with openSearcher=false and autoSoftCommit
> interval is 5 min.
> 
> As mentioned above, we first index just the Articles in stage 1 and then in
> stage 2, the same set of Articles are indexed with children (block join). I
> had a suspicion that the huge amount of time taken by *solrClient.add* call
> can have something to do with the *block join updates *that take place in
> stage 2. Adding fresh joins of Articles with children on an empty
> collection was much faster and ran without SocketTimeout. So I modified our
> indexing pipeline to be as follows.
> 
> 1. stage 1 : index Article documents [A1, A2.....An]
> 2. delete all the Article documents
> 3. stage 2 : index Article documents with children [A1 with children, A2
> with children......An with children]
> 
> With this change, stage 2 would be a simple *add operation and not an
> update operation.* I tested the bulk indexing with this change and it
> finished successfully without any issues in a shorter time period!
> 
> It will be very helpful to know what is the difference between
> A: When we add a document with children when collection does not already
> have the same document
> B: When we add a document with children when collection already has the
> same document without children
> 
> I understand that *update *takes place in B but how can we explain such a
> difference in performance between A and B.
> 
> Please note that we use RxJava and call solrClient.add() in parallel
> threads with a set of Article documents and the socketTimeout issue seems
> to pop up after we have already indexed about 90% of the documents.
> 
> Some more clarity on what could be happening will be very useful.
> 
> Thanks
> 
> On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote:
> 
> > Hi All,
> >
> > *tl;dr* : running into long GC pauses and solr client socket timeouts
> > when indexing bulk of documents into solr. Commit strategy in essence is to
> > do hard commits at the interval of 50k documents (maxDocs=50k) and disable
> > soft commit altogether during bulk indexing. Simple solr cloud set up with
> > one node and one shard.
> >
> > *Details*:
> > We have about 6 million documents which we are trying to index into solr.
> > From these, about 500k documents have a text field which holds Abstracts of
> > scientific papers/Articles. We extract keywords from these Abstracts and we
> > index these keywords as well into solr.
> >
> > We have a many to many kind of relationship between Articles and keywords.
> > To store this, we have following structure.
> >
> > Article documents
> > Keyword documents
> > Article-Keyword Join documents
> >
> > We use block join to index Articles with "Article-Keyword" join documents
> > and Keyword documents are indexed independently.
> >
> > In other words, we have blocks of "Article + Article-Keyword Joins" and we
> > have Keyword documents(they hold some additional metadata about keyword ).
> >
> > We have a bulk processing operation which creates these documents and
> > indexes them into solr. During this bulk indexing, we don't need documents
> > to be searchable. We need to search against them only after ALL the
> > documents are indexed.
> >
> > *Based on this, this is our current strategy. *
> > Soft commits are disabled and Hard commits are done at an interval of 50k
> > documents with openSearcher=false. Our code triggers explicit commits 4
> > times after various stages of bulk indexing. Transaction logs are enabled
> > and have default settings.
> >
> >     <autoCommit>
> >       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
> >       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
> >       <openSearcher>false</openSearcher>
> >     </autoCommit>
> >
> >     <autoSoftCommit>
> >       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> >     </autoSoftCommit>
> >
> > Other Environmental Details:
> > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> > zkClienttimeout=2 mins
> > Our indexing operation triggers many "add" operations in parallel using
> > RxJava (15 to 30 threads) each "add" operation is passed about 1000
> > documents.
> >
> > Currently, when we run this indexing operation, we notice that after a
> > while solr goes into long GC pauses (longer than our sockeTimeout of 7
> > minutes) and we get SocketTimeoutExceptions.
> >
> > *What could be causing such long GC pauses?*
> >
> > *Does this commit strategy make sense ? If not, what is the recommended
> > strategy that we can look into? *
> >
> > *Any help on this is much appreciated. Thanks.*
> >
> >
>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to