Re: Cassandra 2.1: replace running node without streaming

Jürgen Albersdorfer Sat, 03 Feb 2018 08:18:29 -0800

Good Point about the Rack - Kyrill! This makes total sense to me.
Deleting the System Keyspace not really, If this contains all sensitive 
information about the node.
Maybe this makes sense in conjunction with replace_node(at_first_boot) Option.
Some comments from devs about this would be great.


Regards,
Jürgen 

> Am 03.02.2018 um 16:42 schrieb Kyrylo Lebediev <kyrylo_lebed...@epam.com>:
> 
> I've found modified Carlos' article (more recent than that I was referring 
> to) and this one contains the same method as you described, Oleksandr:
> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement
> 
> Thank you for your readiness to help!
> 
> Kind Regards, 
> Kyrill
> From: Kyrylo Lebediev <kyrylo_lebed...@epam.com>
> Sent: Saturday, February 3, 2018 12:23:15 PM
> To: User
> Subject: Re: Cassandra 2.1: replace running node without streaming
>  
> Thank you Oleksandr,
> Just tested on 3.11.1 and it worked for me (you may see the logs below).
> Just comprehended that there is one important prerequisite this method to 
> work: new node MUST be located in the same rack (in terms of C*) as the old 
> one. Otherwise correct replicas placement order will be violated (I mean when 
> replicas of the same token range should be placed in different racks). 
> 
> Anyway, even having successful run of node replacement in sandbox I'm still 
> in doubt. 
> Just wondering why this procedure which seems to be much easier than 
> [add/remove node] or [replace a node] which are documented ways for live node 
> replacement, has never been included into documentation. 
> Does anybody in the ML know the reason for this?
> 
> Also, for some reason in his article Carlos drops files of system keyspace 
> (which contains system.local table):
> In the new node, delete all system tables except for the schema ones. This 
> will ensure that the new Cassandra node will not have any corrupt or previous 
> configuration assigned.
> sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | xargs -I 
> {} sudo rm -rf {}
> 
> http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
> [Carlos, if you are here might you, please, comment ]
> 
> So still a mystery to me..... 
> 
> -----
> Logs for 3.1.11
> -----
> 
> ====== Before:
> --  Address       Load       Tokens       Owns (effective)  Host ID           
>                     Rack
> UN  10.10.10.222  256.61 KiB  3            100.0%            
> bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
> UN  10.10.10.223  225.65 KiB  3            100.0%            
> c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1 <<<<<<
> UN  10.10.10.221  187.39 KiB  3            100.0%            
> d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
> 
> ======= After:
> --  Address       Load       Tokens       Owns (effective)  Host ID           
>                     Rack
> UN  10.10.10.222  245.84 KiB  3            100.0%            
> bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
> UN  10.10.10.221  192.8 KiB  3            100.0%            
> d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
> UN  10.10.10.224  266.61 KiB  3            100.0%            
> c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1  <<<<< 
> 
> 
> 
> ====== Logs from another node (10.10.10.221):
> INFO  [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 
> OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
> INFO  [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node 
> /10.10.10.224 is now part of the cluster
> INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
> InetAddress /10.10.10.224 is now UP
> INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
> InetAddress /10.10.10.224 is now UP
> WARN  [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host 
> ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 
> and /10.10.10.224; /10.10.10.224 is the new owner
> INFO  [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient 
> /10.10.10.223 has been silent for 30000ms, removing from gossip
> 
> ====== Logs from new node:
> INFO  [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: 
> Finish joining ring
> INFO  [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node 
> /10.10.10.223 is now part of the cluster
> WARN  [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not 
> updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 
> because it's mine
> INFO  [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - 
> Nodes /10.10.10.223 and /10.10.10.224 have the same token 
> -7774421781914237508.  Ignoring /10.10.10.223
> INFO  [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - 
> Nodes /10.10.10.223 and /10.10.10.224 have the same token 
> 2257660731441815305.  Ignoring /10.10.10.223
> INFO  [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - 
> Nodes /10.10.10.223 and /10.10.10.224 have the same token 51879124242594885.  
> Ignoring /10.10.10.223
> WARN  [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip 
> stage has 5 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - 
> Executing pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr')
> WARN  [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip 
> stage has 7 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [GossipStage:1] 2018-02-03 11:33:05,718 Gossiper.java:1046 - 
> InetAddress /10.10.10.223 is now DOWN
> INFO  [main] 2018-02-03 11:33:06,872 StorageService.java:2268 - Node 
> /10.10.10.224 state jump to NORMAL
> INFO  [main] 2018-02-03 11:33:06,998 Gossiper.java:1655 - Waiting for gossip 
> to settle...
> INFO  [main] 2018-02-03 11:33:15,004 Gossiper.java:1686 - No gossip backlog; 
> proceeding
> INFO  [GossipTasks:1] 2018-02-03 11:33:20,114 Gossiper.java:1046 - 
> InetAddress /10.10.10.222 is now DOWN    <<<<< have no idea why this appeared 
> in logs
> INFO  [main] 2018-02-03 11:33:20,566 NativeTransportService.java:70 - Netty 
> using native Epoll event loop
> INFO  [HANDSHAKE-/10.10.10.222] 2018-02-03 11:33:20,714 
> OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.222
> 
> 
> 
> 
> Kind Regards, 
> Kyrill
> 
> From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
> Sent: Saturday, February 3, 2018 10:44:26 AM
> To: User
> Subject: Re: Cassandra 2.1: replace running node without streaming
>  
> On 3 Feb 2018 08:49, "Jürgen Albersdorfer" <jalbersdor...@gmail.com> wrote:
> Cool, good to know. Do you know this is still true for 3.11.1?
> 
> Well, I've never tried with that specific version, but this is pretty 
> fundamental, so I would expect it to work the same way. Test in isolation if 
> you want to be sure, though.
> 
> I don't think this is documented anywhere, however, since I had the same 
> doubts before seeing it worked for the first time.
> 
> --
> Alex
> 
> Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin 
> <oleksandr.shul...@zalando.de>:
> 
>> On 3 Feb 2018 02:42, "Kyrylo Lebediev" <kyrylo_lebed...@epam.com> wrote:
>> Thanks, Oleksandr, 
>> In my case I'll need to replace all nodes in the cluster (one-by-one), so 
>> streaming will introduce perceptible overhead.
>> My question is not about data movement/copy itself, but more about all this 
>> token magic. 
>> 
>> Okay, let's say we stopped old node, moved data to new node.
>> Once it's started with auto_bootstrap=false it will be added to the cluster 
>> like an usual node, just skipping streaming stage, right?
>> For a cluster with vnodes enabled, during addition of new node its token 
>> ranges are calculated automatically by C* on startup.
>> 
>> So, how will C* know that this new node must be responsible for exactly the 
>> same token ranges as the old node was?
>> How would the rest of nodes in the cluster ('peers') figure out that old 
>> node should be replaced in ring by the new one?
>> Do you know about some  limitation for this process in case of C* 2.1.x with 
>> vnodes enabled?
>> 
>> A node stores its tokens and host id in the system.local table. Next time it 
>> starts up, it will use the same tokens as previously and the host id allows 
>> the rest of the cluster to see that it is the same node and ignore the IP 
>> address change. This happens regardless of auto_bootstrap setting.
>> 
>> Try "select * from system.local" to see what is recorded for the old node. 
>> When the new node starts up it should log "Using saved tokens" with the list 
>> of numbers. Other nodes should log something like "ignoring IP address 
>> change" for the affected node addresses.
>> 
>> Be careful though, to make sure that you put the data directory exactly 
>> where the new node expects to find it: otherwise it might just join as a 
>> brand new one, allocating new tokens. As a precaution it helps to ensure 
>> that the system user running the Cassandra process has no permission to 
>> create the data directory: this should stop the startup in case of 
>> misconfiguration.
>> 
>> Cheers,
>> --
>> Alex
>> 
>

Re: Cassandra 2.1: replace running node without streaming

Reply via email to