On Fri, Nov 12, 2021 at 10:54 AM Jan Høydahl <jan....@cominvent.com> wrote:
> I agree that this is a risk. It all comes back to your initial sizing of > the cluster. > If you have decided for three nodes, and have HA policy of tolerating loss > of > any one server at a time, then you have to fully stress test your system > with > only two of those three nodes. If the two nodes cannot handle peak > traffic, then > you are fooling yourself to believe that you have fulfilled your HA policy. > Some more crucial systems even have an N+2 HA policy, i.e. you should > tolerate > loss/crash of two random servers at the same time. Even more important to > test > the system in the failing condition! Time is also a factor here. The > longer time it > takes for a single node to reboot, the more likely that another node will > crash during > that window. So keeping the restart time low is always a bonus. > > It could be that if your nodes are few and large, with lots of replicas > and lots of data, > that it would be better to switch to a strategy with more smaller/cheaper > nodes with > fewer replicas each. Then the consequence of a node loss is smaller, and > it is quicker > to recover. > Your reasoning is correct but, IMHO, it is a little bit theoretical. If we are talking about kubernetes and how a solrcloud cluster is deployed, the problem is still about the liveness. May I add few things: - assuming that we have N solr instances with replica N for each core, if the liveness is not so strictly configured, kubernetes can restart in short order all the N instances. In other words with a light liveness configuration, if N is not big enough, you can have all the instances that have one or more cores that are not ready. - I would also add, that many customers don't have the money, the time or the resources to have a bigger cluster or to implement an HA policy so well done. So we need some give-and-take arrangements. On the other hand, being strict on the health check and having good monitoring can do the trick. > > I think and hope that the current liveliness logic in solr-operator is > robust. > > These days I was just digging down how the solr instance liveness is configured in the solr-operator . After having installed the example with 3 nodes I see the liveness is based on /solr/admin/info/system Which is very unhelpful, well if what I have said is right. This is the example I have used: https://apache.github.io/solr-operator/docs/running-the-operator > Jan > > > 12. nov. 2021 kl. 10:43 skrev Vincenzo D'Amore <v.dam...@gmail.com>: > > > > Hi Jan, > > > > I agree, if liveness is not configured correctly we could end up in an > > endless loop and the node never be healthy. > > Please consider another scenario, a common case where there are at least > 3 > > solr instances in production 24/7 high availability with a situation of > > index light/heavy and query-heavy. > > When we have to restart a solr instance, for whatever reason, the number > of > > seconds or minutes that we have to wait until all the cores come up could > > be pretty high. > > If we don't configure the liveness right kubernetes can restart the next > > instance but the former is still recovering, coming up or whatever but it > > is not ready. > > So, for example, when we have to apply a change to solr config on all the > > solr instances we really can't shutdown more than one of them. > > When restarted we must wait for the full availability of the instance and > > in the meanwhile the two remaining instances must have all the cores up > and > > running. > > In other words, when you restart a solr instance, an increase of load on > > the remaining instances usually slows the overall performance but, if > > done badly it can bring the cluster down. > > > > > > On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <jan....@cominvent.com> > wrote: > > > >> If recovery failed, then that core is dead, it has given up. > >> So if an agent has just restarted or started a node, then it will wait > >> until all cores have a "stable" or "final" state, before it declares the > >> NODE as healthy, and consider restarting other nodes. > >> If a core (replica of a shard in a collection) is in DOWN state, it has > >> just booted and will soon go into RECOVERING. It will stay in RECOVERING > >> until it either is OK or RECOVERY_FAILED. > >> There is no point in waiting in an endless loop for every single core > on a > >> node to come up, we just want them to finish initializing and enter a > >> stable state. > >> I guess other logic in solr-operator will take care of deciding how many > >> replicas for a shard are live, as to whether it is safe to take down the > >> next pod/node. > >> > >> Jan > >> > >>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xiaobin_...@foxmail.com>: > >>> > >>> I'm a little puzzled, why UNHEALTHY_STATES does not contain > >> State.RECOVERY_FAILED > >>> > >>>> 2021年10月31日 22:45,Jan Høydahl <jan....@cominvent.com> 写道: > >>>> > >>>> See > >> > https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers > , > >> you can query each node with > >>>> > >>>> http://node:8983/api/node/health?requireHealthyCores=true > >>>> > >>>> It will only return HTTP 200 if all active cores on the node are > >> healthy (none starting or recovering). > >>>> > >>>> Jan > >>>> > >>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v.dam...@gmail.com>: > >>>>> > >>>>> Hi all, > >>>>> > >>>>> when a Solr instance is started I would be sure all the indexes > >> present are > >>>>> up and running, in other words that the instance is healthy. > >>>>> The healthy status (aka liveness/readiness) is especially useful > when a > >>>>> Kubernetes SolrCloud cluster has to be restarted for any > configuration > >>>>> management needs and you want to apply your change one node at a > time. > >>>>> AFAIK I can ping only one index at a time, but there is no way out of > >> the > >>>>> box to test that a bunch of indexes are active (green status). > >>>>> Have you ever faced the same problem? What do you think? > >>>>> > >>>>> Best regards, > >>>>> Vincenzo > >>>>> > >>>>> -- > >>>>> Vincenzo D'Amore > >>>> > >>> > >> > >> > > > > -- > > Vincenzo D'Amore > > -- Vincenzo D'Amore