oops! sent a little too soon ... also wanted to mention that if you're running Solr 8+, you can use /admin/info/health instead of /admin/info/system for the probe path (see: https://issues.apache.org/jira/browse/SOLR-11126), like this:
livenessProbe: httpGet: path: /admin/info/health port: 8983 readinessProbe: httpGet: path: /admin/info/health port: 8983 On Fri, Nov 12, 2021 at 11:11 AM Timothy Potter <thelabd...@apache.org> wrote: > > Some things to consider ... > > If one out of many Solr cores is down on a pod, I would not want > Kubelet to restart my Solr pod (if liveness probe fails) or even > remove it from the load-balancer service (if readiness probe fails) > because the pod can still serve traffic for the healthy cores. > Requiring all cores on a pod to be healthy seems like too high of a > bar for K8s probes. > > Killing a busy Solr pod with large collections with active update and > query traffic can take a "long" time to come back online (long being > relative to your typical Go based microservice that can restart in > milliseconds, which is what these probes were designed for) > > SolrCloud has its own request routing logic based on a very up-to-date > cluster state that's wired into ZK watches, so Solr can be resilient > to downed replicas provided there is at least one per shard that is > healthy. > > Moreover, replicas may take time to recover and the last thing you'd > want is for K8s to restart a pod while a replica is close to > recovering and re-entering the mix as a healthy replica. > > You could maybe use the request to requireHealthyCores=true for a startup > probe. > > For me, the liveness / readiness probes are more applicable for > microservices that are fast to fail and restart and you can have many > of them so pulling one out of the load-balancer due to a readiness > probe failure is usually the right answer. Moreover, with > microservices, you typically have a service that does one thing, but > Solr pods typically host multiple cores. > > Lastly, the Solr operator allows you to customize the probe endpoints, > see: spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe. > We default it to /admin/info/system for the reasons I raised above. > > Tim > > On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie > <mma...@salesforce.com.invalid> wrote: > > > > Beware that using > > http://node:8983/api/node/health?requireHealthyCores=true for > > your liveness assumes that ZK is up and running. > > We are all hoping that ZK is never down, but if it happens, your Solr > > liveness probe will start to fail too, and K8S will restart all our Solr, > > adding instability to a cluster that is already in a bad shape. > > > > We've configured our liveness to /solr/admin/info/system too, and we rely > > on ZK liveness probe to restart ZK quickly if there is an issue. > > Liveness probes should never rely on a subsystem being up, else all your > > services will go down one after the other. > > > > Regards, > > Mathieu