Hi Mathieu, Timothy, Thanks, your contributions just made me realize that in my previous email, the meaning I gave to the word liveness (container probes) was too generic. For the sake of clarity, there are 3 types of container probes: livenessProbe, readinessProbe, startupProbe. livenessProbe: Indicates whether the container is running readinessProbe: Indicates whether the container is ready to respond to requests startupProbe: Indicates whether the application within the container is started
Given that, in my previous email I was referring to a scenario where you are applying a configuration change to your solrcloud cluster (i.e. all the solr pods have to be restarted) There are many situations where you may need to apply a change to your cluster that leads to a full restart: JVM config change (memory, garbage collectors, system properties), Kubernetes config (vertical scale, env variables, logging, etc.) or Solr config. And in all these cases you cannot have /solr/admin/info/system as startupProbe when your instances are queried in production. This leads to catastrophic effects because kubernetes will restart in a short time all the solr instances. Restarting the instances in a short time means that one, more or all the Core in a Solr node don't have the time to become "Active". This happens mostly because, as Timothy said if you are restarting a busy Solr pod with large collections with active update and query traffic can take a "long" time to come back online. On the other hand, /api/node/health?requireHealthyCores=true fits very well with startupProbe in this scenario startupProbe: failureThreshold: 30 httpGet: path: /api/node/health?requireHealthyCores=true port: 8983 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 On Fri, Nov 12, 2021 at 8:26 PM Timothy Potter <thelabd...@gmail.com> wrote: > oops! sent a little too soon ... also wanted to mention that if you're > running Solr 8+, you can use /admin/info/health instead of > /admin/info/system for the probe path (see: > https://issues.apache.org/jira/browse/SOLR-11126), like this: > > livenessProbe: > httpGet: > path: /admin/info/health > port: 8983 > readinessProbe: > httpGet: > path: /admin/info/health > port: 8983 > > > On Fri, Nov 12, 2021 at 11:11 AM Timothy Potter <thelabd...@apache.org> > wrote: > > > > Some things to consider ... > > > > If one out of many Solr cores is down on a pod, I would not want > > Kubelet to restart my Solr pod (if liveness probe fails) or even > > remove it from the load-balancer service (if readiness probe fails) > > because the pod can still serve traffic for the healthy cores. > > Requiring all cores on a pod to be healthy seems like too high of a > > bar for K8s probes. > > > > Killing a busy Solr pod with large collections with active update and > > query traffic can take a "long" time to come back online (long being > > relative to your typical Go based microservice that can restart in > > milliseconds, which is what these probes were designed for) > > > > SolrCloud has its own request routing logic based on a very up-to-date > > cluster state that's wired into ZK watches, so Solr can be resilient > > to downed replicas provided there is at least one per shard that is > > healthy. > > > > Moreover, replicas may take time to recover and the last thing you'd > > want is for K8s to restart a pod while a replica is close to > > recovering and re-entering the mix as a healthy replica. > > > > You could maybe use the request to requireHealthyCores=true for a > startup probe. > > > > For me, the liveness / readiness probes are more applicable for > > microservices that are fast to fail and restart and you can have many > > of them so pulling one out of the load-balancer due to a readiness > > probe failure is usually the right answer. Moreover, with > > microservices, you typically have a service that does one thing, but > > Solr pods typically host multiple cores. > > > > Lastly, the Solr operator allows you to customize the probe endpoints, > > see: > spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe. > > We default it to /admin/info/system for the reasons I raised above. > > > > Tim > > > > On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie > > <mma...@salesforce.com.invalid> wrote: > > > > > > Beware that using > > > http://node:8983/api/node/health?requireHealthyCores=true for > > > your liveness assumes that ZK is up and running. > > > We are all hoping that ZK is never down, but if it happens, your Solr > > > liveness probe will start to fail too, and K8S will restart all our > Solr, > > > adding instability to a cluster that is already in a bad shape. > > > > > > We've configured our liveness to /solr/admin/info/system too, and we > rely > > > on ZK liveness probe to restart ZK quickly if there is an issue. > > > Liveness probes should never rely on a subsystem being up, else all > your > > > services will go down one after the other. > > > > > > Regards, > > > Mathieu > -- Vincenzo D'Amore