Hi Mathieu, Timothy,

Thanks, your contributions just made me realize that in my previous email,
the meaning I gave to the word liveness (container probes) was too generic.
For the sake of clarity, there are 3 types of container probes:
livenessProbe, readinessProbe, startupProbe.
livenessProbe: Indicates whether the container is running
readinessProbe: Indicates whether the container is ready to respond to
requests
startupProbe: Indicates whether the application within the container is
started

Given that, in my previous email I was referring to a scenario where you
are applying a configuration change to your solrcloud cluster (i.e. all the
solr pods have to be restarted)
There are many situations where you may need to apply a change to your
cluster that leads to a full restart: JVM config change (memory, garbage
collectors, system properties), Kubernetes config (vertical scale, env
variables, logging, etc.) or Solr config.
And in all these cases you cannot have /solr/admin/info/system as
startupProbe when your instances are queried in production.
This leads to catastrophic effects because kubernetes will restart in a
short time all the solr instances.
Restarting the instances in a short time means that one, more or all the
Core in a Solr node don't have the time to become "Active".
This happens mostly because, as Timothy said if you are restarting a busy
Solr pod with large collections with active update and
query traffic can take a "long" time to come back online.

On the other hand, /api/node/health?requireHealthyCores=true fits very well
with startupProbe in this scenario

        startupProbe:
          failureThreshold: 30
          httpGet:
            path: /api/node/health?requireHealthyCores=true
            port: 8983
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1



On Fri, Nov 12, 2021 at 8:26 PM Timothy Potter <thelabd...@gmail.com> wrote:

> oops! sent a little too soon ... also wanted to mention that if you're
> running Solr 8+, you can use /admin/info/health instead of
> /admin/info/system for the probe path (see:
> https://issues.apache.org/jira/browse/SOLR-11126), like this:
>
> livenessProbe:
>   httpGet:
>     path: /admin/info/health
>     port: 8983
> readinessProbe:
>   httpGet:
>     path: /admin/info/health
>     port: 8983
>
>
> On Fri, Nov 12, 2021 at 11:11 AM Timothy Potter <thelabd...@apache.org>
> wrote:
> >
> > Some things to consider ...
> >
> > If one out of many Solr cores is down on a pod, I would not want
> > Kubelet to restart my Solr pod (if liveness probe fails) or even
> > remove it from the load-balancer service (if readiness probe fails)
> > because the pod can still serve traffic for the healthy cores.
> > Requiring all cores on a pod to be healthy seems like too high of a
> > bar for K8s probes.
> >
> > Killing a busy Solr pod with large collections with active update and
> > query traffic can take a "long" time to come back online (long being
> > relative to your typical Go based microservice that can restart in
> > milliseconds, which is what these probes were designed for)
> >
> > SolrCloud has its own request routing logic based on a very up-to-date
> > cluster state that's wired into ZK watches, so Solr can be resilient
> > to downed replicas provided there is at least one per shard that is
> > healthy.
> >
> > Moreover, replicas may take time to recover and the last thing you'd
> > want is for K8s to restart a pod while a replica is close to
> > recovering and re-entering the mix as a healthy replica.
> >
> > You could maybe use the request to requireHealthyCores=true for a
> startup probe.
> >
> > For me, the liveness / readiness probes are more applicable for
> > microservices that are fast to fail and restart and you can have many
> > of them so pulling one out of the load-balancer due to a readiness
> > probe failure is usually the right answer. Moreover, with
> > microservices, you typically have a service that does one thing, but
> > Solr pods typically host multiple cores.
> >
> > Lastly, the Solr operator allows you to customize the probe endpoints,
> > see:
> spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe.
> > We default it to /admin/info/system for the reasons I raised above.
> >
> > Tim
> >
> > On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie
> > <mma...@salesforce.com.invalid> wrote:
> > >
> > > Beware that using
> > > http://node:8983/api/node/health?requireHealthyCores=true for
> > > your liveness assumes that ZK is up and running.
> > > We are all hoping that ZK is never down, but if it happens, your Solr
> > > liveness probe will start to fail too, and K8S will restart all our
> Solr,
> > > adding instability to a cluster that is already in a bad shape.
> > >
> > > We've configured our liveness to /solr/admin/info/system too, and we
> rely
> > > on ZK liveness probe to restart ZK quickly if there is an issue.
> > > Liveness probes should never rely on a subsystem being up, else all
> your
> > > services will go down one after the other.
> > >
> > > Regards,
> > > Mathieu
>


-- 
Vincenzo D'Amore

Reply via email to