Re: Commit strategy for Heavy Bulk Indexing into solr

Pratik Patel Tue, 27 Jul 2021 12:44:39 -0700

So it looks like I have narrowed down where the problem is and have also
found a workaround but I would like to understand more.

As I had mentioned, we have two stages in our bulk indexing operation.

stage 1 : index Article documents [A1, A2.....An]
stage 2 : index Article documents with children [A1 with children, A2 with
children......An with children]

We were always running into issues in stage 2.
After some time in stage 2, *solrClient.add( <setOfBlockJoinDocs>,
commitWithin )* starts to timeout and then these timeouts happen
consistently. Even the socketTimeout of 30 mins was exceeded by add call
and we got socketTimeoutException.

We have set commitWithin to be 6 hours to avoid unnecessary soft commits.
Auto commit interval is 1 min with openSearcher=false and autoSoftCommit
interval is 5 min.

As mentioned above, we first index just the Articles in stage 1 and then in
stage 2, the same set of Articles are indexed with children (block join). I
had a suspicion that the huge amount of time taken by *solrClient.add* call
can have something to do with the *block join updates *that take place in
stage 2. Adding fresh joins of Articles with children on an empty
collection was much faster and ran without SocketTimeout. So I modified our
indexing pipeline to be as follows.

1. stage 1 : index Article documents [A1, A2.....An]
2. delete all the Article documents
3. stage 2 : index Article documents with children [A1 with children, A2
with children......An with children]

With this change, stage 2 would be a simple *add operation and not an
update operation.* I tested the bulk indexing with this change and it
finished successfully without any issues in a shorter time period!

It will be very helpful to know what is the difference between
A: When we add a document with children when collection does not already
have the same document
B: When we add a document with children when collection already has the
same document without children

I understand that *update *takes place in B but how can we explain such a
difference in performance between A and B.

Please note that we use RxJava and call solrClient.add() in parallel
threads with a set of Article documents and the socketTimeout issue seems
to pop up after we have already indexed about 90% of the documents.

Some more clarity on what could be happening will be very useful.

Thanks

On Fri, Jul 23, 2021 at 2:31 PM Pratik Patel <pra...@semandex.net> wrote:

> Hi All,
>
> *tl;dr* : running into long GC pauses and solr client socket timeouts
> when indexing bulk of documents into solr. Commit strategy in essence is to
> do hard commits at the interval of 50k documents (maxDocs=50k) and disable
> soft commit altogether during bulk indexing. Simple solr cloud set up with
> one node and one shard.
>
> *Details*:
> We have about 6 million documents which we are trying to index into solr.
> From these, about 500k documents have a text field which holds Abstracts of
> scientific papers/Articles. We extract keywords from these Abstracts and we
> index these keywords as well into solr.
>
> We have a many to many kind of relationship between Articles and keywords.
> To store this, we have following structure.
>
> Article documents
> Keyword documents
> Article-Keyword Join documents
>
> We use block join to index Articles with "Article-Keyword" join documents
> and Keyword documents are indexed independently.
>
> In other words, we have blocks of "Article + Article-Keyword Joins" and we
> have Keyword documents(they hold some additional metadata about keyword ).
>
> We have a bulk processing operation which creates these documents and
> indexes them into solr. During this bulk indexing, we don't need documents
> to be searchable. We need to search against them only after ALL the
> documents are indexed.
>
> *Based on this, this is our current strategy. *
> Soft commits are disabled and Hard commits are done at an interval of 50k
> documents with openSearcher=false. Our code triggers explicit commits 4
> times after various stages of bulk indexing. Transaction logs are enabled
> and have default settings.
>
>     <autoCommit>
>       <maxTime>${solr.autoCommit.maxTime:-1}</maxTime>
>       <maxDocs>${solr.autoCommit.maxDocs:50000}</maxDocs>
>       <openSearcher>false</openSearcher>
>     </autoCommit>
>
>     <autoSoftCommit>
>       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>     </autoSoftCommit>
>
> Other Environmental Details:
> Xms=8g and Xmx=14g, solr client socketTimeout=7 minutes and
> zkClienttimeout=2 mins
> Our indexing operation triggers many "add" operations in parallel using
> RxJava (15 to 30 threads) each "add" operation is passed about 1000
> documents.
>
> Currently, when we run this indexing operation, we notice that after a
> while solr goes into long GC pauses (longer than our sockeTimeout of 7
> minutes) and we get SocketTimeoutExceptions.
>
> *What could be causing such long GC pauses?*
>
> *Does this commit strategy make sense ? If not, what is the recommended
> strategy that we can look into? *
>
> *Any help on this is much appreciated. Thanks.*
>
>

Re: Commit strategy for Heavy Bulk Indexing into solr

Reply via email to