So it looks like I have narrowed down where the problem is and have also found a workaround but I would like to understand more.
As I had mentioned, we have two stages in our bulk indexing operation. stage 1 : index Article documents [A1, A2.....An] stage 2 : index Article documents with children [A1 with children, A2 with children......An with children] We were always running into issues in stage 2. After some time in stage 2, *solrClient.add( <setOfBlockJoinDocs>, commitWithin )* starts to timeout and then these timeouts happen consistently. Even the socketTimeout of 30 mins was exceeded by add call and we got socketTimeoutException. We have set commitWithin to be 6 hours to avoid unnecessary soft commits. Auto commit interval is 1 min with openSearcher=false and autoSoftCommit interval is 5 min. As mentioned above, we first index just the Articles in stage 1 and then in stage 2, the same set of Articles are indexed with children (block join). I had a suspicion that the huge amount of time taken by *solrClient.add* call can have something to do with the *block join updates *that take place in stage 2. Adding fresh joins of Articles with children on an empty collection was much faster and ran without SocketTimeout. So I modified our indexing pipeline to be as follows. 1. stage 1 : index Article documents [A1, A2.....An] 2. delete all the Article documents 3. stage 2 : index Article documents with children [A1 with children, A2 with children......An with children] With this change, stage 2 would be a simple *add operation and not an update operation.* I tested the bulk indexing with this change and it finished successfully without any issues in a shorter time period! It will be very helpful to know what is the difference between A: When we add a document with children when collection does not already have the same document B: When we add a document with children when collection already has the same document without children I understand that *update *takes place in B but how can we explain such a difference in performance between A and B. Please note that we use RxJava and call solrClient.add() in parallel threads with a set of Article documents and the socketTimeout issue seems to pop up after we have already indexed about 90% of the documents. Some more clarity on what could be happening will be very useful. Thanks On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote: > Hi All, > > *tl;dr* : running into long GC pauses and solr client socket timeouts > when indexing bulk of documents into solr. Commit strategy in essence is to > do hard commits at the interval of 50k documents (maxDocs=50k) and disable > soft commit altogether during bulk indexing. Simple solr cloud set up with > one node and one shard. > > *Details*: > We have about 6 million documents which we are trying to index into solr. > From these, about 500k documents have a text field which holds Abstracts of > scientific papers/Articles. We extract keywords from these Abstracts and we > index these keywords as well into solr. > > We have a many to many kind of relationship between Articles and keywords. > To store this, we have following structure. > > Article documents > Keyword documents > Article-Keyword Join documents > > We use block join to index Articles with "Article-Keyword" join documents > and Keyword documents are indexed independently. > > In other words, we have blocks of "Article + Article-Keyword Joins" and we > have Keyword documents(they hold some additional metadata about keyword ). > > We have a bulk processing operation which creates these documents and > indexes them into solr. During this bulk indexing, we don't need documents > to be searchable. We need to search against them only after ALL the > documents are indexed. > > *Based on this, this is our current strategy. * > Soft commits are disabled and Hard commits are done at an interval of 50k > documents with openSearcher=false. Our code triggers explicit commits 4 > times after various stages of bulk indexing. Transaction logs are enabled > and have default settings. > > <autoCommit> > <maxTime>${solr.autoCommit.maxTime:-1}</maxTime> > <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs> > <openSearcher>false</openSearcher> > </autoCommit> > > <autoSoftCommit> > <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> > </autoSoftCommit> > > Other Environmental Details: > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and > zkClienttimeout=2 mins > Our indexing operation triggers many "add" operations in parallel using > RxJava (15 to 30 threads) each "add" operation is passed about 1000 > documents. > > Currently, when we run this indexing operation, we notice that after a > while solr goes into long GC pauses (longer than our sockeTimeout of 7 > minutes) and we get SocketTimeoutExceptions. > > *What could be causing such long GC pauses?* > > *Does this commit strategy make sense ? If not, what is the recommended > strategy that we can look into? * > > *Any help on this is much appreciated. Thanks.* > >