There were some big changes related to child indexing in solr 8.8, under this ticket: https://issues.apache.org/jira/browse/SOLR-14923 It's worth updating solr to latest 8.8 and trying again, perhaps your indexing issue has already been fixed.
On 2021/07/27 19:44:13, Pratik Patel <pra...@semandex.net> wrote: > So it looks like I have narrowed down where the problem is and have also > found a workaround but I would like to understand more. > > As I had mentioned, we have two stages in our bulk indexing operation. > > stage 1 : index Article documents [A1, A2.....An] > stage 2 : index Article documents with children [A1 with children, A2 with > children......An with children] > > We were always running into issues in stage 2. > After some time in stage 2, *solrClient.add( <setOfBlockJoinDocs>, > commitWithin )* starts to timeout and then these timeouts happen > consistently. Even the socketTimeout of 30 mins was exceeded by add call > and we got socketTimeoutException. > > We have set commitWithin to be 6 hours to avoid unnecessary soft commits. > Auto commit interval is 1 min with openSearcher=false and autoSoftCommit > interval is 5 min. > > As mentioned above, we first index just the Articles in stage 1 and then in > stage 2, the same set of Articles are indexed with children (block join). I > had a suspicion that the huge amount of time taken by *solrClient.add* call > can have something to do with the *block join updates *that take place in > stage 2. Adding fresh joins of Articles with children on an empty > collection was much faster and ran without SocketTimeout. So I modified our > indexing pipeline to be as follows. > > 1. stage 1 : index Article documents [A1, A2.....An] > 2. delete all the Article documents > 3. stage 2 : index Article documents with children [A1 with children, A2 > with children......An with children] > > With this change, stage 2 would be a simple *add operation and not an > update operation.* I tested the bulk indexing with this change and it > finished successfully without any issues in a shorter time period! > > It will be very helpful to know what is the difference between > A: When we add a document with children when collection does not already > have the same document > B: When we add a document with children when collection already has the > same document without children > > I understand that *update *takes place in B but how can we explain such a > difference in performance between A and B. > > Please note that we use RxJava and call solrClient.add() in parallel > threads with a set of Article documents and the socketTimeout issue seems > to pop up after we have already indexed about 90% of the documents. > > Some more clarity on what could be happening will be very useful. > > Thanks > > On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote: > > > Hi All, > > > > *tl;dr* : running into long GC pauses and solr client socket timeouts > > when indexing bulk of documents into solr. Commit strategy in essence is to > > do hard commits at the interval of 50k documents (maxDocs=50k) and disable > > soft commit altogether during bulk indexing. Simple solr cloud set up with > > one node and one shard. > > > > *Details*: > > We have about 6 million documents which we are trying to index into solr. > > From these, about 500k documents have a text field which holds Abstracts of > > scientific papers/Articles. We extract keywords from these Abstracts and we > > index these keywords as well into solr. > > > > We have a many to many kind of relationship between Articles and keywords. > > To store this, we have following structure. > > > > Article documents > > Keyword documents > > Article-Keyword Join documents > > > > We use block join to index Articles with "Article-Keyword" join documents > > and Keyword documents are indexed independently. > > > > In other words, we have blocks of "Article + Article-Keyword Joins" and we > > have Keyword documents(they hold some additional metadata about keyword ). > > > > We have a bulk processing operation which creates these documents and > > indexes them into solr. During this bulk indexing, we don't need documents > > to be searchable. We need to search against them only after ALL the > > documents are indexed. > > > > *Based on this, this is our current strategy. * > > Soft commits are disabled and Hard commits are done at an interval of 50k > > documents with openSearcher=false. Our code triggers explicit commits 4 > > times after various stages of bulk indexing. Transaction logs are enabled > > and have default settings. > > > > <autoCommit> > > <maxTime>${solr.autoCommit.maxTime:-1}</maxTime> > > <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs> > > <openSearcher>false</openSearcher> > > </autoCommit> > > > > <autoSoftCommit> > > <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> > > </autoSoftCommit> > > > > Other Environmental Details: > > Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and > > zkClienttimeout=2 mins > > Our indexing operation triggers many "add" operations in parallel using > > RxJava (15 to 30 threads) each "add" operation is passed about 1000 > > documents. > > > > Currently, when we run this indexing operation, we notice that after a > > while solr goes into long GC pauses (longer than our sockeTimeout of 7 > > minutes) and we get SocketTimeoutExceptions. > > > > *What could be causing such long GC pauses?* > > > > *Does this commit strategy make sense ? If not, what is the recommended > > strategy that we can look into? * > > > > *Any help on this is much appreciated. Thanks.* > > > > >