On Fri, Nov 12, 2021 at 10:54 AM Jan Høydahl <jan....@cominvent.com> wrote:

> I agree that this is a risk. It all comes back to your initial sizing of
> the cluster.
> If you have decided for three nodes, and have HA policy of tolerating loss
> of
> any one server at a time, then you have to fully stress test your system
> with
> only two of those three nodes. If the two nodes cannot handle peak
> traffic, then
> you are fooling yourself to believe that you have fulfilled your HA policy.
> Some more crucial systems even have an N+2 HA policy, i.e. you should
> tolerate
> loss/crash of two random servers at the same time. Even more important to
> test
> the system in the failing condition! Time is also a factor here. The
> longer time it
> takes for a single node to reboot, the more likely that another node will
> crash during
> that window. So keeping the restart time low is always a bonus.
>
> It could be that if your nodes are few and large, with lots of replicas
> and lots of data,
> that it would be better to switch to a strategy with more smaller/cheaper
> nodes with
> fewer replicas each. Then the consequence of a node loss is smaller, and
> it is quicker
> to recover.
>

Your reasoning is correct but, IMHO, it is a little bit theoretical.
If we are talking about kubernetes and how a solrcloud cluster is deployed,
the problem is still about the liveness.
May I add few things:
- assuming that we have N solr instances with replica N for each core, if
the liveness is not so strictly configured, kubernetes can restart in short
order all the N instances.
In other words with a light liveness configuration, if N is not big enough,
you can have all the instances that have one or more cores that are not
ready.
- I would also add, that many customers don't have the money, the time or
the resources to have a bigger cluster or to implement an HA policy so well
done. So we need some give-and-take arrangements.
On the other hand, being strict on the health check and having good
monitoring can do the trick.


>
> I think and hope that the current liveliness logic in solr-operator is
> robust.
>
>
These days I was just digging down how the solr instance liveness is
configured in the solr-operator .
After having installed the example with 3 nodes I see the liveness is based
on  /solr/admin/info/system
Which is very unhelpful, well if what I have said is right.

This is the example I have used:
https://apache.github.io/solr-operator/docs/running-the-operator



> Jan
>
> > 12. nov. 2021 kl. 10:43 skrev Vincenzo D'Amore <v.dam...@gmail.com>:
> >
> > Hi Jan,
> >
> > I agree, if liveness is not configured correctly we could end up in an
> > endless loop and the node never be healthy.
> > Please consider another scenario, a common case where there are at least
> 3
> > solr instances in production 24/7 high availability with a situation of
> > index light/heavy and query-heavy.
> > When we have to restart a solr instance, for whatever reason, the number
> of
> > seconds or minutes that we have to wait until all the cores come up could
> > be pretty high.
> > If we don't configure the liveness right kubernetes can restart the next
> > instance but the former is still recovering, coming up or whatever but it
> > is not ready.
> > So, for example, when we have to apply a change to solr config on all the
> > solr instances we really can't shutdown more than one of them.
> > When restarted we must wait for the full availability of the instance and
> > in the meanwhile the two remaining instances must have all the cores up
> and
> > running.
> > In other words, when you restart a solr instance, an increase of load on
> > the remaining instances usually slows the overall performance but, if
> > done badly it can bring the cluster down.
> >
> >
> > On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <jan....@cominvent.com>
> wrote:
> >
> >> If recovery failed, then that core is dead, it has given up.
> >> So if an agent has just restarted or started a node, then it will wait
> >> until all cores have a "stable" or "final" state, before it declares the
> >> NODE as healthy, and consider restarting other nodes.
> >> If a core (replica of a shard in a collection) is in DOWN state, it has
> >> just booted and will soon go into RECOVERING. It will stay in RECOVERING
> >> until it either is OK or RECOVERY_FAILED.
> >> There is no point in waiting in an endless loop for every single core
> on a
> >> node to come up, we just want them to finish initializing and enter a
> >> stable state.
> >> I guess other logic in solr-operator will take care of deciding how many
> >> replicas for a shard are live, as to whether it is safe to take down the
> >> next pod/node.
> >>
> >> Jan
> >>
> >>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xiaobin_...@foxmail.com>:
> >>>
> >>> I'm a little puzzled, why UNHEALTHY_STATES does not contain
> >> State.RECOVERY_FAILED
> >>>
> >>>> 2021年10月31日 22:45,Jan Høydahl <jan....@cominvent.com> 写道:
> >>>>
> >>>> See
> >>
> https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers
> ,
> >> you can query each node with
> >>>>
> >>>> http://node:8983/api/node/health?requireHealthyCores=true
> >>>>
> >>>> It will only return HTTP 200 if all active cores on the node are
> >> healthy (none starting or recovering).
> >>>>
> >>>> Jan
> >>>>
> >>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v.dam...@gmail.com>:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> when a Solr instance is started I would be sure all the indexes
> >> present are
> >>>>> up and running, in other words that the instance is healthy.
> >>>>> The healthy status (aka liveness/readiness) is especially useful
> when a
> >>>>> Kubernetes SolrCloud cluster has to be restarted for any
> configuration
> >>>>> management needs and you want to apply your change one node at a
> time.
> >>>>> AFAIK I can ping only one index at a time, but there is no way out of
> >> the
> >>>>> box to test that a bunch of indexes are active (green status).
> >>>>> Have you ever faced the same problem? What do you think?
> >>>>>
> >>>>> Best regards,
> >>>>> Vincenzo
> >>>>>
> >>>>> --
> >>>>> Vincenzo D'Amore
> >>>>
> >>>
> >>
> >>
> >
> > --
> > Vincenzo D'Amore
>
>

-- 
Vincenzo D'Amore

Reply via email to