Not sure about reordered DBQs, but DBQs do not play well with concurrent updates because it needs to block all updates in order to make sure all replicas delete the same set of documents. That can result in queueing updates, recovery and even OOM. Here is and older blogpost explaining that https://www.od-bits.com/2018/03/dbq-or-delete-by-query.html <https://www.od-bits.com/2018/03/dbq-or-delete-by-query.html>. It would be the best to avoid using DBQ and convert it to query + DBID.
HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 23 Apr 2021, at 08:26, Rekha Sekhar <rekhaa.sek...@gmail.com> wrote: > > Hi, > > Gentle reminder...it's highly appreciated your advise on this. > > Thanks > Rekha > > On Thu, 22 Apr, 2021, 1:13 PM Rekha Sekhar, <rekhaa.sek...@gmail.com> wrote: > >> Hi, >> >> We are experiencing heavy slowness on updates for SolrCloud implementation. >> We are using it as 1 shard with 2 clusters. Also we have 3 zookeeper >> nodes. The *solr* version is *8.7.0 *and *ZK* version is *3.6.2*. >> >> Everyday we have some heavy updates (100,000 to 500,000 updates processed >> parallel) >> which includes delete, add and update. >> We have a total 2.2 M records indexed. >> >> Every time when this updates/deletes happens we are seeing a lot of >> '*Reordered >> DBQs detected*’ messages and finally the processing becomes very very >> slow and an update/delete request is increasing from 100ms to 30 minutes to >> complete. >> At the same time we started to see error messages with "*Task queue >> processing has stalled for 115231 ms with 100 remaining elements to process* >> ” and "*Idle timeout expired: 120000/120000 ms*” , “*cancel_stream_error* >> ” etc. >> Sometimes one node goes to a recovery state and recovers after some time. >> >> >> 2021-04-21 18:21:08.219 INFO (qtp938463537-5550) [c:datacore s:shard1 >> r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.DirectUpdateHandler2 >> *Reordered >> DBQs detected*. >> Update=add{_version_=1697674936592629760,id=S-5942167-P-108089342-F-800102562-E-180866483} >> DBQs=[DBQ{version=1697674943496454144,q=store_element_id:395699}, >> DBQ{version=1697674943408373760,q=store_element_id:395698}, >> DBQ{version=1697674943311904768,q=store_element_id:395678}, >> DBQ{version=1697674943221727232,q=store_element_id:395649}, >> DBQ{version=1697674943143084032,q=store_element_id:395642}, >> DBQ{version=1697674943049760768,q=store_element_id:395612}, >> DBQ{version=1697674942964826112,q=store_element_id:395602}, >> DBQ{version=1697674942871502848,q=store_element_id:395587}, >> DBQ{version=1697674942790762496,q=store_element_id:395582}, >> DBQ{version=1697674942711070720,q=store_element_id:395578}, >> DBQ{version=1697674942622990336,q=store_element_id:199511}, >> DBQ{version=1697674942541201408,q=store_element_id:199508}, >> DBQ{version=1697674942452072448,q=store_element_id:397242}, >> DBQ{version=1697674942356652032,q=store_element_id:397194}, >> DBQ{version=1697674942268571648,q=store_element_id:397166}, >> DBQ{version=1697674942178394112,q=store_element_id:397164}, >> DBQ{version=1697674942014816256,q=store_element_id:397149}, >> DBQ{version=1697674941901570048,q=store_element_id:395758}, >> DBQ{version=1697674941790420992,q=store_element_id:395725}, >> DBQ{version=1697674941723312128,q=store_element_id:395630}, >> >> 2021-04-21 18:30:23.636 INFO >> (recoveryExecutor-11-thread-5-processing-n:solr-1.solrcluster:8983_solr >> x:datacore_shard1_replica_n2 c:datacore s:shard1 r:core_node4) [c:datacore >> s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.c.*RecoveryStrategy >> PeerSync Recovery was not successful - trying replication.* >> 2021-04-21 18:30:23.636 INFO >> (recoveryExecutor-11-thread-5-processing-n:solr-1.solrcluster:8983_solr >> x:datacore_shard1_replica_n2 c:datacore s:shard1 r:core_node4) [c:datacore >> s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.c.*RecoveryStrategy >> Starting Replication Recovery.* >> >> >> Below given the autoCommit and autoSoftCommit values used. >> >> <*autoCommit*> >> <maxTime>${solr.autoCommit.maxTime:*90000*}</maxTime> >> <openSearcher>*false*</openSearcher> >> </autoCommit> >> <*autoSoftCommit*> >> <maxTime>${solr.autoSoftCommit.maxTime:*15000*}</maxTime> >> </*autoSoftCommit*> >> >> Here is the GC logs for reference: >> >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) User=0.07s Sys=0.01s >> Real=0.04s >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Pause Young (Normal) >> (G1 Evacuation Pause) 7849M->1725M(10240M) 39.247ms >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Metaspace: >> 85486K->85486K(1126400K) >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Humongous regions: >> 15->15 >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Old regions: 412->412 >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Survivor regions: >> 5->5(192) >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Eden regions: >> 1531->0(1531) >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Other: 0.5ms >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Post Evacuate >> Collection Set: 6.1ms >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Evacuate Collection >> Set: 32.5ms >> [2021-04-21T18:28:32.296+0000][408073.843s] GC(541) Pre Evacuate >> Collection Set: 0.1ms >> [2021-04-21T18:28:32.257+0000][408073.804s] GC(541) Using 2 workers of 2 >> for evacuation >> [2021-04-21T18:28:32.257+0000][408073.804s] GC(541) Pause Young (Normal) >> (G1 Evacuation Pause) >> >> >> Would really appreciate any help on this. >> >> Thanks, >> Rekha >>