Hi Shawn, Will try applying the changes..you have suggested..and get back on this.
On Thu, Jan 19, 2023 at 4:08 PM Rohit Walecha <rohi...@fnp.com> wrote: > We have multiple collections inside our cluster(3 node), but we have some > collections having replication factor 1 and some collections having > replication factor 2..should this be impacting our nodes..and sending them > in recovery state..and restart !! > > On Wed, Jan 18, 2023 at 7:07 PM Rohit Walecha <rohi...@fnp.com> wrote: > >> [image: Screenshot from 2023-01-18 19-06-33.png] >> >> Restart pattern is above. >> >> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <rohi...@fnp.com> wrote: >> >>> Hi, >>> >>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple >>> environments which is connected to a 3 node *zookeeper(3.6.2)* cluster >>> And, we have been facing frequent restarts of solr cloud nodes since the >>> last few months..tried to debug this and while looking into the logs and >>> other stats we have been seeing that the node which has restarted says : >>> >>> *1. * >>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [ >>> ] o.a.s.c.c.ConnectionManager Watcher >>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name: >>> ZooKeeperConnection >>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/ >>> got event WatchedEvent state:Disconnected type:None path:null path: null >>> type: None >>> which probably says *event state is either disconnected or expired*, >>> and says following as a warning : >>> WARN (zkConnectionManagerCallback-13-thread-1) [ ] >>> o.a.s.c.c.ConnectionManager zkClient has disconnected >>> >>> >>> >>> *2*. >>> Client session timed out, have not heard from server in 30018ms for >>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr. >>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ] >>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04 >>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish >>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as >>> DOWN >>> Attached *050120223-solr-cloud-0.log* >>> >>> >>> >>> *Meanwhile zookeeper node says following the time at which solr node >>> gets restarted : * >>> >>> 2023-01-15 07:11:44,349 [myid:2] - WARN >>> [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old >>> client /10.70.26.0:54584; will be dropped if server is in r-o mode >>> 2023-01-15 07:11:44,350 [myid:2] - INFO >>> [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session >>> 0x200042f19cf130f >>> 2023-01-15 07:11:44,352 [myid:2] - INFO >>> [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global >>> closeSession request for session 0x200042f19cf130f >>> >>> >>> Now we are at a point where *we know that when the solr node is getting >>> restarted, who is is pushed down the node and as we can see in the logs at >>> [#2]* which says something like Client session timed out and it is a >>> session which is getting established between solr node and zookeeper also >>> while debugging this issue we have went through a series of issues reported >>> in the current version of *zookeeper *we are using which in gist says about >>> slower leader election and zookeeper nodes getting restarted and the whole >>> zookeeper cluster going down while a leader is getting >>> unhealthy/stopped/restarted and leader election happening again which is >>> taking a long time which leads to client sessions are getting timed out >>> during that period of time. >>> >>> We have tried to replicate the same on the local env by setting up a solr >>> and zookeeper cluster by forcefully restarting/stopping leader zookeeper >>> nodes and we have got something like : >>> *have-not-heard-back-local-cluster.log *and We could replicate [#2]. >>> >>> Seeking help here..to find out what could be the possible reason for these >>> frequent restarts of solr cloud nodes. >>> *Regards. >>> * >>> >>>