I agree that this is a risk. It all comes back to your initial sizing of the cluster. If you have decided for three nodes, and have HA policy of tolerating loss of any one server at a time, then you have to fully stress test your system with only two of those three nodes. If the two nodes cannot handle peak traffic, then you are fooling yourself to believe that you have fulfilled your HA policy. Some more crucial systems even have an N+2 HA policy, i.e. you should tolerate loss/crash of two random servers at the same time. Even more important to test the system in the failing condition! Time is also a factor here. The longer time it takes for a single node to reboot, the more likely that another node will crash during that window. So keeping the restart time low is always a bonus.
It could be that if your nodes are few and large, with lots of replicas and lots of data, that it would be better to switch to a strategy with more smaller/cheaper nodes with fewer replicas each. Then the consequence of a node loss is smaller, and it is quicker to recover. I think and hope that the current liveliness logic in solr-operator is robust. Jan > 12. nov. 2021 kl. 10:43 skrev Vincenzo D'Amore <v.dam...@gmail.com>: > > Hi Jan, > > I agree, if liveness is not configured correctly we could end up in an > endless loop and the node never be healthy. > Please consider another scenario, a common case where there are at least 3 > solr instances in production 24/7 high availability with a situation of > index light/heavy and query-heavy. > When we have to restart a solr instance, for whatever reason, the number of > seconds or minutes that we have to wait until all the cores come up could > be pretty high. > If we don't configure the liveness right kubernetes can restart the next > instance but the former is still recovering, coming up or whatever but it > is not ready. > So, for example, when we have to apply a change to solr config on all the > solr instances we really can't shutdown more than one of them. > When restarted we must wait for the full availability of the instance and > in the meanwhile the two remaining instances must have all the cores up and > running. > In other words, when you restart a solr instance, an increase of load on > the remaining instances usually slows the overall performance but, if > done badly it can bring the cluster down. > > > On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <jan....@cominvent.com> wrote: > >> If recovery failed, then that core is dead, it has given up. >> So if an agent has just restarted or started a node, then it will wait >> until all cores have a "stable" or "final" state, before it declares the >> NODE as healthy, and consider restarting other nodes. >> If a core (replica of a shard in a collection) is in DOWN state, it has >> just booted and will soon go into RECOVERING. It will stay in RECOVERING >> until it either is OK or RECOVERY_FAILED. >> There is no point in waiting in an endless loop for every single core on a >> node to come up, we just want them to finish initializing and enter a >> stable state. >> I guess other logic in solr-operator will take care of deciding how many >> replicas for a shard are live, as to whether it is safe to take down the >> next pod/node. >> >> Jan >> >>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xiaobin_...@foxmail.com>: >>> >>> I'm a little puzzled, why UNHEALTHY_STATES does not contain >> State.RECOVERY_FAILED >>> >>>> 2021年10月31日 22:45,Jan Høydahl <jan....@cominvent.com> 写道: >>>> >>>> See >> https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, >> you can query each node with >>>> >>>> http://node:8983/api/node/health?requireHealthyCores=true >>>> >>>> It will only return HTTP 200 if all active cores on the node are >> healthy (none starting or recovering). >>>> >>>> Jan >>>> >>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v.dam...@gmail.com>: >>>>> >>>>> Hi all, >>>>> >>>>> when a Solr instance is started I would be sure all the indexes >> present are >>>>> up and running, in other words that the instance is healthy. >>>>> The healthy status (aka liveness/readiness) is especially useful when a >>>>> Kubernetes SolrCloud cluster has to be restarted for any configuration >>>>> management needs and you want to apply your change one node at a time. >>>>> AFAIK I can ping only one index at a time, but there is no way out of >> the >>>>> box to test that a bunch of indexes are active (green status). >>>>> Have you ever faced the same problem? What do you think? >>>>> >>>>> Best regards, >>>>> Vincenzo >>>>> >>>>> -- >>>>> Vincenzo D'Amore >>>> >>> >> >> > > -- > Vincenzo D'Amore