How does Cassandra with vnodes exactly decide how many vnodes to move?
The num_tokens setting in the yaml file. What did you set this to?
256, same as on all other nodes.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand
@aaronmorton
http://www.thelastpickle.com
On 14/04/2013, at 11:56 AM, Rustam Aliyev <rustam.li...@code.az> wrote:
Just a followup on this issue. Due to the cost of shuffle, we decided not to do
it. Recently, we added new node and ended up in not well balanced cluster:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 10.0.1.8 52.28 GB 260 18.3%
d28df6a6-c888-4658-9be1-f9e286368dce rack1
UN 10.0.1.11 55.21 GB 256 9.4%
7b0cf3c8-0c42-4443-9b0c-68f794299443 rack1
UN 10.0.1.2 49.03 GB 259 17.9%
2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f rack1
UN 10.0.1.4 48.51 GB 255 18.4%
c253dcdf-3e93-495c-baf1-e4d2a033bce3 rack1
UN 10.0.1.1 67.14 GB 253 17.9%
4f77fd70-b134-486b-9c25-cfea96b6d412 rack1
UN 10.0.1.3 47.65 GB 253 18.0%
4d03690d-5363-42c1-85c2-5084596e09fc rack1
It looks like new node took from each other node equal amount of vnodes - which
is good. However, it's not clear why it decided to have twice less than other
nodes.
How does Cassandra with vnodes exactly decide how many vnodes to move?
Btw, during JOINING nodetool status command does not show any information about
joining node. It appears only when join finished (on v1.2.3).
-- Rustam
On 08/04/2013 22:33, Rustam Aliyev wrote:
After 2 days of endless compactions and streaming I had to stop this and cancel
shuffle. One of the nodes even complained that there's no free disk space (grew
from 30GB to 400GB). After all these problems number of the moved tokens were
less than 40 (out of 1280!).
Now, when nodes start they report duplicate ranges. I wonder how bad is that
and how do I get rid of that?
INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java (line 1386)
Nodes /10.0.1.2 and /10.0.1.1 have the same token
99027485685976232531333625990885670910. Ignoring /10.0.1.2
INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java (line 1386)
Nodes /10.0.1.2 and /10.0.1.4 have the same token
43199909863009765869373729459111198718. Ignoring /10.0.1.2
Overall, I'm not sure how bad it is to leave data unshuffled (I read DataStax
blog post, not clear). When adding new node wouldn't it be assigned ranges
randomly from all nodes?
Some other notes inline below:
On 08/04/2013 15:00, Eric Evans wrote:
[ Rustam Aliyev ]
Hi,
After upgrading to the vnodes I created and enabled shuffle
operation as suggested. After running for a couple of hours I had to
disable it because nodes were not catching up with compactions. I
repeated this process 3 times (enable/disable).
I have 5 nodes and each of them had ~35GB. After shuffle operations
described above some nodes are now reaching ~170GB. In the log files
I can see same files transferred 2-4 times to the same host within
the same shuffle session. Worst of all, after all of these I had
only 20 vnodes transferred out of 1280. So if it will continue at
the same speed it will take about a month or two to complete
shuffle.
As Edward says, you'll need to issue a cleanup post-shuffle if you expect
to see disk usage match your expectations.
I had few question to better understand shuffle:
1. Does disabling and re-enabling shuffle starts shuffle process from
scratch or it resumes from the last point?
It resumes.
2. Will vnode reallocations speedup as shuffle proceeds or it will
remain the same?
The shuffle proceeds synchronously, 1 range at a time; It's not going to
speed up as it progresses.
3. Why I see multiple transfers of the same file to the same host? e.g.:
INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8
INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8
I'm not sure, but perhaps that file contained data for two different
ranges?
Does it mean that if I have huge file (e.g. 20GB) which contain a lot of ranges
(let's say 100) it will be transferred each time (20GB*100)?
4. When I enable/disable shuffle I receive warning message such as
below. Do I need to worry about it?
cassandra-shuffle -h localhost disable
Failed to enable shuffling on 10.0.1.1!
Failed to enable shuffling on 10.0.1.3!
Is that the verbatim output? Did it report failing to enable when you
tried to disable?
Yes, this is verbatim output. It reports failure for enable as well as disable.
Nodes .1.1 and .1.3 were not RELOCATING unless I ran cassandra-shuffle enable
command on them locally.
As a rule of thumb though, you don't want an disable/enable to result in
only a subset of nodes shuffling. Are there no other errors? What do
the logs say?
No errors in logs. Only INFO about streams and WARN about relocation.
I couldn't find many docs on shuffle, only read through JIRA and
original proposal by Eric.