Hi Rekha, Do you also have query load while indexing? Have you tried the TLOG + PULL replica types? https://solr.apache.org/guide/8_4/shards-and-indexing-data-in-solrcloud.html#types-of-replicas
Thanks, Wei On Thu, Apr 22, 2021 at 11:27 PM Rekha Sekhar <rekhaa.sek...@gmail.com> wrote: > Hi, > > Gentle reminder...it's highly appreciated your advise on this. > > Thanks > Rekha > > On Thu, 22 Apr, 2021, 1:13 PM Rekha Sekhar, <rekhaa.sek...@gmail.com> > wrote: > > > Hi, > > > > We are experiencing heavy slowness on updates for SolrCloud > implementation. > > We are using it as 1 shard with 2 clusters. Also we have 3 zookeeper > > nodes. The *solr* version is *8.7.0 *and *ZK* version is *3.6.2*. > > > > Everyday we have some heavy updates (100,000 to 500,000 updates > processed parallel) > > which includes delete, add and update. > > We have a total 2.2 M records indexed. > > > > Every time when this updates/deletes happens we are seeing a lot of > '*Reordered > > DBQs detected*’ messages and finally the processing becomes very very > > slow and an update/delete request is increasing from 100ms to 30 minutes > to > > complete. > > At the same time we started to see error messages with "*Task queue > > processing has stalled for 115231 ms with 100 remaining elements to > process* > > ” and "*Idle timeout expired: 120000/120000 ms*” , > “*cancel_stream_error* > > ” etc. > > Sometimes one node goes to a recovery state and recovers after some time. > > > > > > 2021-04-21 18:21:08.219 INFO (qtp938463537-5550) [c:datacore s:shard1 > > r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.DirectUpdateHandler2 > *Reordered > > DBQs detected*. > > > Update=add{_version_=1697674936592629760,id=S-5942167-P-108089342-F-800102562-E-180866483} > > DBQs=[DBQ{version=1697674943496454144,q=store_element_id:395699}, > > DBQ{version=1697674943408373760,q=store_element_id:395698}, > > DBQ{version=1697674943311904768,q=store_element_id:395678}, > > DBQ{version=1697674943221727232,q=store_element_id:395649}, > > DBQ{version=1697674943143084032,q=store_element_id:395642}, > > DBQ{version=1697674943049760768,q=store_element_id:395612}, > > DBQ{version=1697674942964826112,q=store_element_id:395602}, > > DBQ{version=1697674942871502848,q=store_element_id:395587}, > > DBQ{version=1697674942790762496,q=store_element_id:395582}, > > DBQ{version=1697674942711070720,q=store_element_id:395578}, > > DBQ{version=1697674942622990336,q=store_element_id:199511}, > > DBQ{version=1697674942541201408,q=store_element_id:199508}, > > DBQ{version=1697674942452072448,q=store_element_id:397242}, > > DBQ{version=1697674942356652032,q=store_element_id:397194}, > > DBQ{version=1697674942268571648,q=store_element_id:397166}, > > DBQ{version=1697674942178394112,q=store_element_id:397164}, > > DBQ{version=1697674942014816256,q=store_element_id:397149}, > > DBQ{version=1697674941901570048,q=store_element_id:395758}, > > DBQ{version=1697674941790420992,q=store_element_id:395725}, > > DBQ{version=1697674941723312128,q=store_element_id:395630}, > > > > 2021-04-21 18:30:23.636 INFO > > (recoveryExecutor-11-thread-5-processing-n:solr-1.solrcluster:8983_solr > > x:datacore_shard1_replica_n2 c:datacore s:shard1 r:core_node4) > [c:datacore > > s:shard1 r:core_node4 x:datacore_shard1_replica_n2] > o.a.s.c.*RecoveryStrategy > > PeerSync Recovery was not successful - trying replication.* > > 2021-04-21 18:30:23.636 INFO > > (recoveryExecutor-11-thread-5-processing-n:solr-1.solrcluster:8983_solr > > x:datacore_shard1_replica_n2 c:datacore s:shard1 r:core_node4) > [c:datacore > > s:shard1 r:core_node4 x:datacore_shard1_replica_n2] > o.a.s.c.*RecoveryStrategy > > Starting Replication Recovery.* > > > > > > Below given the autoCommit and autoSoftCommit values used. > > > > <*autoCommit*> > > <maxTime>${solr.autoCommit.maxTime:*90000*}</maxTime> > > <openSearcher>*false*</openSearcher> > > </autoCommit> > > <*autoSoftCommit*> > > <maxTime>${solr.autoSoftCommit.maxTime:*15000*}</maxTime> > > </*autoSoftCommit*> > > > > Here is the GC logs for reference: > > > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) User=0.07s Sys=0.01s > > Real=0.04s > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Pause Young (Normal) > > (G1 Evacuation Pause) 7849M->1725M(10240M) 39.247ms > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Metaspace: > > 85486K->85486K(1126400K) > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Humongous regions: > > 15->15 > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Old regions: 412->412 > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Survivor regions: > > 5->5(192) > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Eden regions: > > 1531->0(1531) > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Other: 0.5ms > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Post Evacuate > > Collection Set: 6.1ms > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Evacuate Collection > > Set: 32.5ms > > [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Pre Evacuate > > Collection Set: 0.1ms > > [2021-04-21T18:28:32.257+0000][408073.804s] GC(541) Using 2 workers of 2 > > for evacuation > > [2021-04-21T18:28:32.257+0000][408073.804s] GC(541) Pause Young (Normal) > > (G1 Evacuation Pause) > > > > > > Would really appreciate any help on this. > > > > Thanks, > > Rekha > > >