Hi Jan, I agree, if liveness is not configured correctly we could end up in an endless loop and the node never be healthy. Please consider another scenario, a common case where there are at least 3 solr instances in production 24/7 high availability with a situation of index light/heavy and query-heavy. When we have to restart a solr instance, for whatever reason, the number of seconds or minutes that we have to wait until all the cores come up could be pretty high. If we don't configure the liveness right kubernetes can restart the next instance but the former is still recovering, coming up or whatever but it is not ready. So, for example, when we have to apply a change to solr config on all the solr instances we really can't shutdown more than one of them. When restarted we must wait for the full availability of the instance and in the meanwhile the two remaining instances must have all the cores up and running. In other words, when you restart a solr instance, an increase of load on the remaining instances usually slows the overall performance but, if done badly it can bring the cluster down.
On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <jan....@cominvent.com> wrote: > If recovery failed, then that core is dead, it has given up. > So if an agent has just restarted or started a node, then it will wait > until all cores have a "stable" or "final" state, before it declares the > NODE as healthy, and consider restarting other nodes. > If a core (replica of a shard in a collection) is in DOWN state, it has > just booted and will soon go into RECOVERING. It will stay in RECOVERING > until it either is OK or RECOVERY_FAILED. > There is no point in waiting in an endless loop for every single core on a > node to come up, we just want them to finish initializing and enter a > stable state. > I guess other logic in solr-operator will take care of deciding how many > replicas for a shard are live, as to whether it is safe to take down the > next pod/node. > > Jan > > > 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xiaobin_...@foxmail.com>: > > > > I'm a little puzzled, why UNHEALTHY_STATES does not contain > State.RECOVERY_FAILED > > > >> 2021年10月31日 22:45,Jan Høydahl <jan....@cominvent.com> 写道: > >> > >> See > https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, > you can query each node with > >> > >> http://node:8983/api/node/health?requireHealthyCores=true > >> > >> It will only return HTTP 200 if all active cores on the node are > healthy (none starting or recovering). > >> > >> Jan > >> > >>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v.dam...@gmail.com>: > >>> > >>> Hi all, > >>> > >>> when a Solr instance is started I would be sure all the indexes > present are > >>> up and running, in other words that the instance is healthy. > >>> The healthy status (aka liveness/readiness) is especially useful when a > >>> Kubernetes SolrCloud cluster has to be restarted for any configuration > >>> management needs and you want to apply your change one node at a time. > >>> AFAIK I can ping only one index at a time, but there is no way out of > the > >>> box to test that a bunch of indexes are active (green status). > >>> Have you ever faced the same problem? What do you think? > >>> > >>> Best regards, > >>> Vincenzo > >>> > >>> -- > >>> Vincenzo D'Amore > >> > > > > -- Vincenzo D'Amore