On 5/13/2019 8:26 AM, Maulin Rathod wrote:
Recently we are observing issue where solr node (any random node) automatically 
goes into recovery mode and stops responding.

Do you KNOW that these Solr instances actually need a 60GB heap? That's a HUGE heap. When a full GC happens on a heap that large, it's going to be a long pause, and there's nothing that can be done about it.

We have enough memory allocated to Solr (60 gb) and system also have enough 
memory (300 gb)...

As just mentioned, unless you are CERTAIN that you need a 60GB heap, which most users do not, don't set it that high. Any advice you read that says "set the heap to XX percent of the installed system memory" will frequently result in a setting that's incorrect for your specific setup.

And if you really DO need a 60GB heap, it would be recommended to either add more servers and put less of your index on each one, or to split your replicas between two Solr instances each running 31GB or less -- as Erick mentioned in his reply.

We have analyzed GC logs and found that there was GC pause time of 29.6583943 
second when problem happened. Can this GC Pause lead to make the node 
unavailable/recovery mode? or there could be some another reason ?

Please note we have set zkClientTimeout to 10 minutes (zkClientTimeout=600000) 
so that zookeeper will not consider this node unavailable during high GC pause 
time.

You can't actually set that timeout that high. I believe that ZooKeeper limits the session timeout to 20 times the tickTime, which is typically set to 2 seconds. So 40 seconds is typically the maximum you can have for that timeout. Solr's zkClientTimeout value is used to set ZooKeeper's session timeout.

And, as Erick also mentioned, there are other ways that a long GC pause can cause problems other than that specific timeout. SolrCloud is not going to work well with a huge heap ... eventually a full GC is going to happen, and if it takes more than a few seconds, it's going to cause issues.

Thanks,
Shawn

Reply via email to