Re: Recovery from disk failure

Jan Høydahl Tue, 06 May 2025 09:44:29 -0700

Hi,

I have seen the same happening myself, and I agree it is somewhat unexpected. I 
believe there may be some reason beind the behavior, but cannot think of any 
right now.


So what would work for you right now would be to get rid of the dead replica in 
zookeeper (try DELETEREPLICA), and then do an ADDREPLICA on the new empty box, 
which will create a new core and start syncing.
Not sure if you are able to remove the replica in that state, but give it a try.
So, until we decide to build a different default behavior in the "startup on 
empty disk but zk says we should have collection-A-shard-1-replica-2" case, the 
best way would be to first move all replicas away from the node that will 
upgrade disk, and then move them back again.

Jan

> 6. mai 2025 kl. 13:55 skrev Karl Stoney 
> <karl.sto...@autotrader.co.uk.INVALID>:
> 
> Hi,
> I run solr cloud on GKE; and I’m trying to move my pods to a new disk type.  
> In doing so the disk will be brand new.  I’ve landed in a position that I’m 
> unsure how to recover from, where the new node is not syncing data from the 
> leader.
> 
> To explain exactly what’s happening, lets say I have two nodes:
> 
>  *
> solr-0
>  *
> solr-1
> 
> And both are active and fully replicated.
> I take solr-1 down, and point it at the new disk (which is empty), and bring 
> it back up.
> The server starts fine, I can access solr-1 via the UI, but it never 
> recovers, in the “Cloud -> Graph” UI, I can see the shard on solr-1 is down.
> 
> I can see it in the “Cloud -> Nodes” GUI as up, however its collections have 
> a funny state, for example: "postcodes-006_s1r9_(down):  undefined", vs 
> solr-0 which shows "postcodes-006_s1r11:  847.3Mb”.
> 
> I was expecting the node to come up and see its disk was empty, and resync 
> its data from the leader, but instead it’s just sat doing, nothing….
> 
> The fact I’m moving to new disks is somewhat moot, more broadly this is 
> showing me that if we lost data on a node for whatever reason, it doesn’t 
> “fix itself” - which I always (maybe blindly) assumed it would, because when 
> I bring up brand new nodes (different name) it does.
> 
> Could anyone advise what I’ve done wrong here, and what the process should be 
> to get a node to resend its data entirely?
> 
> This is what the API shows:
> 
> 
> shard1":{
> 
>            "range":"80000000-7fffffff",
> 
>            "replicas":{
> 
>              "core_node10":{
> 
>                "core":"postcodes-006_shard1_replica_n9",
> 
>                
> "node_name":"solr-1.search-solr-next.svc.cluster.local:80_solr",
> 
>                "type":"NRT",
> 
>                "state":"down",
> 
>                "force_set_state":"false",
> 
>                
> "base_url":"http://solr-1.search-solr-next.svc.cluster.local:80/solr";
> 
>              },
> 
>              "core_node12":{
> 
>                "core":"postcodes-006_shard1_replica_n11",
> 
>                
> "node_name":"solr-0.search-solr-next.svc.cluster.local:80_solr",
> 
>                "type":"NRT",
> 
>                "state":"active",
> 
>                "leader":"true",
> 
>                "force_set_state":"false",
> 
>                
> "base_url":"http://solr-0.search-solr-next.svc.cluster.local:80/solr";,
> 
>                "property.preferredleader":"true"
> 
>              }
> 
>            },
> 
>            "state":"active",
> 
>            "health":"ORANGE"
> 
>          }
> 
> 
> 
> 
> Unless expressly stated otherwise in this email, this e-mail is sent on 
> behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, 
> Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto 
> Trader Limited is part of the Auto Trader Group Plc group. This email and any 
> files transmitted with it are confidential and may be legally privileged, and 
> intended solely for the use of the individual or entity to whom they are 
> addressed. If you have received this email in error please notify the sender. 
> This email message has been swept for the presence of computer viruses.

Re: Recovery from disk failure

Reply via email to