Some things to consider ...

If one out of many Solr cores is down on a pod, I would not want
Kubelet to restart my Solr pod (if liveness probe fails) or even
remove it from the load-balancer service (if readiness probe fails)
because the pod can still serve traffic for the healthy cores.
Requiring all cores on a pod to be healthy seems like too high of a
bar for K8s probes.

Killing a busy Solr pod with large collections with active update and
query traffic can take a "long" time to come back online (long being
relative to your typical Go based microservice that can restart in
milliseconds, which is what these probes were designed for)

SolrCloud has its own request routing logic based on a very up-to-date
cluster state that's wired into ZK watches, so Solr can be resilient
to downed replicas provided there is at least one per shard that is
healthy.

Moreover, replicas may take time to recover and the last thing you'd
want is for K8s to restart a pod while a replica is close to
recovering and re-entering the mix as a healthy replica.

You could maybe use the request to requireHealthyCores=true for a startup probe.

For me, the liveness / readiness probes are more applicable for
microservices that are fast to fail and restart and you can have many
of them so pulling one out of the load-balancer due to a readiness
probe failure is usually the right answer. Moreover, with
microservices, you typically have a service that does one thing, but
Solr pods typically host multiple cores.

Lastly, the Solr operator allows you to customize the probe endpoints,
see: spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe.
We default it to /admin/info/system for the reasons I raised above.

Tim

On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie
<mma...@salesforce.com.invalid> wrote:
>
> Beware that using
> http://node:8983/api/node/health?requireHealthyCores=true for
> your liveness assumes that ZK is up and running.
> We are all hoping that ZK is never down, but if it happens, your Solr
> liveness probe will start to fail too, and K8S will restart all our Solr,
> adding instability to a cluster that is already in a bad shape.
>
> We've configured our liveness to /solr/admin/info/system too, and we rely
> on ZK liveness probe to restart ZK quickly if there is an issue.
> Liveness probes should never rely on a subsystem being up, else all your
> services will go down one after the other.
>
> Regards,
> Mathieu

Reply via email to