joshsouza opened a new issue, #471: URL: https://github.com/apache/solr-operator/issues/471
We are just starting out with the Solr Operator, and intend on moving several large Solr clusters over to leveraging the operator for their management. In our initial tests, we've encountered a situation that seems incredibly risky, and we would like to understand whether there is a reasonable solution for this in place, or good suggestions for how to improve reliability around it. The logic around `SolrCloud.Spec.updateStrategy` being `Managed` (https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy) means that the operator will never take an action that risks cluster stability (shutting down a pod that would result in no live replicas etc...) This is fantastic, but only relates to actions that the operator itself would make (statefulset updates etc...), and doesn't appear to come into play during normal _kubernetes_ operations, such as node rotations. On an EKS cluster, when a node group is refreshed, the nodes are marked for termination within their autoscaling groups, and subsequently their pods are drained from the nodes that are to be shut down, and re-scheduled to valid nodes. Normal k8s operations to prevent service disruptions during this type of an event are to utilize Pod Disruption Budgets, which prevents the draining nodes from stopping their pods if it would cause a disruption. This leverages Readiness/Liveness status to determine when a disruption would occur, and is generally a reliable way of preventing applications from becoming unavailable. With Solr, there is another level of abstraction, as a Solr pod being "ready" doesn't mean that all of the cores on that node are available/replicated, and thus a pod disruption budget, which only monitors that readiness state, may perceive that it is safe to delete an arbitrary pod in the cluster without the necessary logic (which the Operator has) of checking whether that pod would cause a disruption should it be shut down. Since with a large cluster, nodes/pods coming up and down may take time to recover, and without a PDB, you may risk multiple pods going down simultaneously, there is a risk (we perceive) that Solr's availability could be at risk should a node rotation or other form of pod deletion etc... occur outside the Operator's pervue. So, my question is: What methodology is recommended for eliminating this risk? Are there configurations we've overlooked that will reduce this risk? Has the community simply accepted this limitation and found ways to reduce the odds of being impacted? (are we maybe overreacting, and this isn't actually a risk?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org