It seems like Zookeeper have stopped due to some reason. It's worth to clarify where the particular Zk instances are running, check that they are standalone and not embedded into other processes like Solr. Then it's worth to focus on stabilizing the ZK ensemble first, after that let Solr nodes connect to it one by one; watch cpu/ram. After solr is running stable it makes sense to try to turn Kafka on.
On Thu, Jan 19, 2023 at 12:15 AM Rohit Walecha <rohi...@fnp.com> wrote: > [image: Screenshot from 2023-01-18 19-06-33.png] > > Restart pattern is above. > > On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <rohi...@fnp.com> wrote: > >> Hi, >> >> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments >> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have >> been facing frequent restarts of solr cloud nodes since the last few >> months..tried to debug this and while looking into the logs and other stats >> we have been seeing that the node which has restarted says : >> >> *1. * >> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [ >> ] o.a.s.c.c.ConnectionManager Watcher >> org.apache.solr.common.cloud.ConnectionManager@731cf36d name: >> ZooKeeperConnection >> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/ >> got event WatchedEvent state:Disconnected type:None path:null path: null >> type: None >> which probably says *event state is either disconnected or expired*, and >> says following as a warning : >> WARN (zkConnectionManagerCallback-13-thread-1) [ ] >> o.a.s.c.c.ConnectionManager zkClient has disconnected >> >> >> >> *2*. >> Client session timed out, have not heard from server in 30018ms for >> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr. >> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ] >> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04 >> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish >> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as >> DOWN >> Attached *050120223-solr-cloud-0.log* >> >> >> >> *Meanwhile zookeeper node says following the time at which solr node gets >> restarted : * >> >> 2023-01-15 07:11:44,349 [myid:2] - WARN >> [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old >> client /10.70.26.0:54584; will be dropped if server is in r-o mode >> 2023-01-15 07:11:44,350 [myid:2] - INFO >> [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session >> 0x200042f19cf130f >> 2023-01-15 07:11:44,352 [myid:2] - INFO >> [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global >> closeSession request for session 0x200042f19cf130f >> >> >> Now we are at a point where *we know that when the solr node is getting >> restarted, who is is pushed down the node and as we can see in the logs at >> [#2]* which says something like Client session timed out and it is a session >> which is getting established between solr node and zookeeper also while >> debugging this issue we have went through a series of issues reported in the >> current version of *zookeeper *we are using which in gist says about slower >> leader election and zookeeper nodes getting restarted and the whole >> zookeeper cluster going down while a leader is getting >> unhealthy/stopped/restarted and leader election happening again which is >> taking a long time which leads to client sessions are getting timed out >> during that period of time. >> >> We have tried to replicate the same on the local env by setting up a solr >> and zookeeper cluster by forcefully restarting/stopping leader zookeeper >> nodes and we have got something like : >> *have-not-heard-back-local-cluster.log *and We could replicate [#2]. >> >> Seeking help here..to find out what could be the possible reason for these >> frequent restarts of solr cloud nodes. >> *Regards. >> * >> >> -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic!