Some things to consider ... If one out of many Solr cores is down on a pod, I would not want Kubelet to restart my Solr pod (if liveness probe fails) or even remove it from the load-balancer service (if readiness probe fails) because the pod can still serve traffic for the healthy cores. Requiring all cores on a pod to be healthy seems like too high of a bar for K8s probes.
Killing a busy Solr pod with large collections with active update and query traffic can take a "long" time to come back online (long being relative to your typical Go based microservice that can restart in milliseconds, which is what these probes were designed for) SolrCloud has its own request routing logic based on a very up-to-date cluster state that's wired into ZK watches, so Solr can be resilient to downed replicas provided there is at least one per shard that is healthy. Moreover, replicas may take time to recover and the last thing you'd want is for K8s to restart a pod while a replica is close to recovering and re-entering the mix as a healthy replica. You could maybe use the request to requireHealthyCores=true for a startup probe. For me, the liveness / readiness probes are more applicable for microservices that are fast to fail and restart and you can have many of them so pulling one out of the load-balancer due to a readiness probe failure is usually the right answer. Moreover, with microservices, you typically have a service that does one thing, but Solr pods typically host multiple cores. Lastly, the Solr operator allows you to customize the probe endpoints, see: spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe. We default it to /admin/info/system for the reasons I raised above. Tim On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie <mma...@salesforce.com.invalid> wrote: > > Beware that using > http://node:8983/api/node/health?requireHealthyCores=true for > your liveness assumes that ZK is up and running. > We are all hoping that ZK is never down, but if it happens, your Solr > liveness probe will start to fail too, and K8S will restart all our Solr, > adding instability to a cluster that is already in a bad shape. > > We've configured our liveness to /solr/admin/info/system too, and we rely > on ZK liveness probe to restart ZK quickly if there is an issue. > Liveness probes should never rely on a subsystem being up, else all your > services will go down one after the other. > > Regards, > Mathieu