Hi Patryk, thanks. Unfortunately snapshotting was not an option here, as we use Solr in a Statefulset on Kubernetes, when I was changing the disk type from pd-ssd to hyper disks (due to new node types), this results in the node coming up and immediately creating and mounting the new, empty disk, as per the volumeClassTemplate in the Statefulset. There’s no “kubernetes native” way that I can see to get snapshotting into that process.
I personally think it would be kind of nice to have Solr recognise the fact the replica in ZK is gone from disk and just resync/recover it from the other nodes. It just sat in a “down” state when there are N other nodes green it could replicate back from feels odd. I have a bit of a utopian view where clusters should self-fix as much as possible really without human intervention! In case anyone was curious, what I’ve to automate the REMOVE+ADD is written a kubernetes operator which watches the Pods in the Statefulset. Those pods have a readinessGate (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) on them. The operator watches for the main pod Ready event (eg its readiness probe passing which means solr is up) and then queries its replicas, if any of them are “down” it does the API calls to delete then add the replicas. Only once all the replicas are back does it add the status condition for the readinessGate which subsequently puts it back in the load balancer. From: Patryk Mazurkiewicz <pmaz...@gmail.com> Date: Friday, 9 May 2025 at 20:08 To: users@solr.apache.org <users@solr.apache.org> Subject: Re: Recovery from disk failure Hi Karl, Attaching the disk removed the old replica and the core.properties file references (which is part of core discovery), which caused the core to not be able to load. The down replica is a ZK reference to that replica. DELETEREPLICA and then ADDREPLICA as mentioned above should fix it. You may consider using a snapshot and then restoring from it when you create a new disk. Thanks, Patryk On Wed, May 7, 2025 at 10:58 PM anon anon <anonimoussech...@gmail.com> wrote: > > did you tried > solr delete -c YOUR_DUPLICATA > ? I am noob and I am just asking. > > Le mer. 7 mai 2025 à 09:07, Karl Stoney > <karl.sto...@autotrader.co.uk.invalid> a écrit : > > > Ah no worries, thanks for the reply. > > I already have a customer operator that watches the statefulsets and does > > some admin type stuff, guess ill shim into there the recovery logic for now. > > > > > > From: Jan Høydahl <jan....@cominvent.com> > > Date: Tuesday, 6 May 2025 at 17:45 > > To: users@solr.apache.org <users@solr.apache.org> > > Subject: Re: Recovery from disk failure > > > > Hi, > > > > I have seen the same happening myself, and I agree it is somewhat > > unexpected. I believe there may be some reason beind the behavior, but > > cannot think of any right now. > > > > So what would work for you right now would be to get rid of the dead > > replica in zookeeper (try DELETEREPLICA), and then do an ADDREPLICA on the > > new empty box, which will create a new core and start syncing. > > Not sure if you are able to remove the replica in that state, but give it > > a try. > > So, until we decide to build a different default behavior in the "startup > > on empty disk but zk says we should have collection-A-shard-1-replica-2" > > case, the best way would be to first move all replicas away from the node > > that will upgrade disk, and then move them back again. > > > > Jan > > > > > 6. mai 2025 kl. 13:55 skrev Karl Stoney <karl.sto...@autotrader.co.uk > > .INVALID>: > > > > > > Hi, > > > I run solr cloud on GKE; and I’m trying to move my pods to a new disk > > type. In doing so the disk will be brand new. I’ve landed in a position > > that I’m unsure how to recover from, where the new node is not syncing data > > from the leader. > > > > > > To explain exactly what’s happening, lets say I have two nodes: > > > > > > * > > > solr-0 > > > * > > > solr-1 > > > > > > And both are active and fully replicated. > > > I take solr-1 down, and point it at the new disk (which is empty), and > > bring it back up. > > > The server starts fine, I can access solr-1 via the UI, but it never > > recovers, in the “Cloud -> Graph” UI, I can see the shard on solr-1 is > > down. > > > > > > I can see it in the “Cloud -> Nodes” GUI as up, however its collections > > have a funny state, for example: "postcodes-006_s1r9_(down): undefined", > > vs solr-0 which shows "postcodes-006_s1r11: 847.3Mb”. > > > > > > I was expecting the node to come up and see its disk was empty, and > > resync its data from the leader, but instead it’s just sat doing, nothing…. > > > > > > The fact I’m moving to new disks is somewhat moot, more broadly this is > > showing me that if we lost data on a node for whatever reason, it doesn’t > > “fix itself” - which I always (maybe blindly) assumed it would, because > > when I bring up brand new nodes (different name) it does. > > > > > > Could anyone advise what I’ve done wrong here, and what the process > > should be to get a node to resend its data entirely? > > > > > > This is what the API shows: > > > > > > > > > shard1":{ > > > > > > "range":"80000000-7fffffff", > > > > > > "replicas":{ > > > > > > "core_node10":{ > > > > > > "core":"postcodes-006_shard1_replica_n9", > > > > > > > > "node_name":"solr-1.search-solr-next.svc.cluster.local:80_solr", > > > > > > "type":"NRT", > > > > > > "state":"down", > > > > > > "force_set_state":"false", > > > > > > "base_url":" > > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsolr-1.search-solr-next.svc.cluster.local%2Fsolr&data=05%7C02%7CKarl.Stoney%40autotrader.co.uk%7C30c8840038e548ea025d08dd8f2cceea%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C638824144953234134%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=sipfrSiOWPq%2FG1GU6ZxvYtZD3x9CkSpKyKnty%2BtibXQ%3D&reserved=0<http://solr-1.search-solr-next.svc.cluster.local/solr> > > " > > > > > > }, > > > > > > "core_node12":{ > > > > > > "core":"postcodes-006_shard1_replica_n11", > > > > > > > > "node_name":"solr-0.search-solr-next.svc.cluster.local:80_solr", > > > > > > "type":"NRT", > > > > > > "state":"active", > > > > > > "leader":"true", > > > > > > "force_set_state":"false", > > > > > > "base_url":" > > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsolr-0.search-solr-next.svc.cluster.local%2Fsolr&data=05%7C02%7CKarl.Stoney%40autotrader.co.uk%7C30c8840038e548ea025d08dd8f2cceea%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C638824144953247647%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=kdsxl6j1s%2F4qKtcnONVweSoIZ4MuMPAl9jETzQzmd7Q%3D&reserved=0<http://solr-0.search-solr-next.svc.cluster.local/solr> > > ", > > > > > > "property.preferredleader":"true" > > > > > > } > > > > > > }, > > > > > > "state":"active", > > > > > > "health":"ORANGE" > > > > > > } > > > > > > > > > > > > > > > Unless expressly stated otherwise in this email, this e-mail is sent on > > behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, > > Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto > > Trader Limited is part of the Auto Trader Group Plc group. This email and > > any files transmitted with it are confidential and may be legally > > privileged, and intended solely for the use of the individual or entity to > > whom they are addressed. If you have received this email in error please > > notify the sender. This email message has been swept for the presence of > > computer viruses. > > > > > > > > Unless expressly stated otherwise in this email, this e-mail is sent on > > behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, > > Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto > > Trader Limited is part of the Auto Trader Group Plc group. This email and > > any files transmitted with it are confidential and may be legally > > privileged, and intended solely for the use of the individual or entity to > > whom they are addressed. If you have received this email in error please > > notify the sender. This email message has been swept for the presence of > > computer viruses. > > Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.