Hi,
After messing around with my Cassandra cluster recently, I think I need some 
basic understanding on how things work behind scene regarding data streaming.
Let's say we have three node cluster with RF = 3.  If node 3 for some reason 
dies and I want to replace it with a new node with the same (maybe minus one) 
range. During the bootstrap, how the data is streamed?
From what I observed, Node 3 has replicates for its primary range on node 4, 5. 
So it streams the data from them and starts to compact them. Also, node 3 holds 
replicates for primary range of node 2, so it streams data from node 2 and node 
4. Similarly, it holds replicates for node 1. So data streamed from node 1 and 
node 2. So during the bootstaping, it basically gets the data from all the 
replicates (2 copies each), so it will require double the disk space in order 
to hold the data? Over the time, those SStables will be compacted and redundant 
will be removed? Is it true?

if we issue nodetool repair -pr on node 3, apart from streaming data from node 
4, 5 to 3. We also see data stream between node 4, 5 since they hold the 
replicates. But I don't see log regarding "merkle tree calculation" on node 
4,5. Just wondering how they know what data to stream in order to repair node 
4, 5?

Thanks.
-Wei

Reply via email to