> How does Cassandra with vnodes exactly decide how many vnodes to move? The num_tokens setting in the yaml file. What did you set this to?
Cheers ----------------- Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 14/04/2013, at 11:56 AM, Rustam Aliyev <rustam.li...@code.az> wrote: > Just a followup on this issue. Due to the cost of shuffle, we decided not to > do it. Recently, we added new node and ended up in not well balanced cluster: > > Datacenter: datacenter1 > ======================= > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host ID > Rack > UN 10.0.1.8 52.28 GB 260 18.3% > d28df6a6-c888-4658-9be1-f9e286368dce rack1 > UN 10.0.1.11 55.21 GB 256 9.4% > 7b0cf3c8-0c42-4443-9b0c-68f794299443 rack1 > UN 10.0.1.2 49.03 GB 259 17.9% > 2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f rack1 > UN 10.0.1.4 48.51 GB 255 18.4% > c253dcdf-3e93-495c-baf1-e4d2a033bce3 rack1 > UN 10.0.1.1 67.14 GB 253 17.9% > 4f77fd70-b134-486b-9c25-cfea96b6d412 rack1 > UN 10.0.1.3 47.65 GB 253 18.0% > 4d03690d-5363-42c1-85c2-5084596e09fc rack1 > > It looks like new node took from each other node equal amount of vnodes - > which is good. However, it's not clear why it decided to have twice less than > other nodes. > > How does Cassandra with vnodes exactly decide how many vnodes to move? > > Btw, during JOINING nodetool status command does not show any information > about joining node. It appears only when join finished (on v1.2.3). > > -- Rustam > > > On 08/04/2013 22:33, Rustam Aliyev wrote: >> After 2 days of endless compactions and streaming I had to stop this and >> cancel shuffle. One of the nodes even complained that there's no free disk >> space (grew from 30GB to 400GB). After all these problems number of the >> moved tokens were less than 40 (out of 1280!). >> >> Now, when nodes start they report duplicate ranges. I wonder how bad is that >> and how do I get rid of that? >> >> INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java (line >> 1386) Nodes /10.0.1.2 and /10.0.1.1 have the same token >> 99027485685976232531333625990885670910. Ignoring /10.0.1.2 >> INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java (line >> 1386) Nodes /10.0.1.2 and /10.0.1.4 have the same token >> 43199909863009765869373729459111198718. Ignoring /10.0.1.2 >> >> Overall, I'm not sure how bad it is to leave data unshuffled (I read >> DataStax blog post, not clear). When adding new node wouldn't it be assigned >> ranges randomly from all nodes? >> >> Some other notes inline below: >> >> On 08/04/2013 15:00, Eric Evans wrote: >>> [ Rustam Aliyev ] >>>> Hi, >>>> >>>> After upgrading to the vnodes I created and enabled shuffle >>>> operation as suggested. After running for a couple of hours I had to >>>> disable it because nodes were not catching up with compactions. I >>>> repeated this process 3 times (enable/disable). >>>> >>>> I have 5 nodes and each of them had ~35GB. After shuffle operations >>>> described above some nodes are now reaching ~170GB. In the log files >>>> I can see same files transferred 2-4 times to the same host within >>>> the same shuffle session. Worst of all, after all of these I had >>>> only 20 vnodes transferred out of 1280. So if it will continue at >>>> the same speed it will take about a month or two to complete >>>> shuffle. >>> As Edward says, you'll need to issue a cleanup post-shuffle if you expect >>> to see disk usage match your expectations. >>> >>>> I had few question to better understand shuffle: >>>> >>>> 1. Does disabling and re-enabling shuffle starts shuffle process from >>>> scratch or it resumes from the last point? >>> It resumes. >>> >>>> 2. Will vnode reallocations speedup as shuffle proceeds or it will >>>> remain the same? >>> The shuffle proceeds synchronously, 1 range at a time; It's not going to >>> speed up as it progresses. >>> >>>> 3. Why I see multiple transfers of the same file to the same host? e.g.: >>>> >>>> INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038 >>>> StreamReplyVerbHandler.java (line 44) Successfully sent >>>> /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db >>>> to /10.0.1.8 >>>> INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427 >>>> StreamReplyVerbHandler.java (line 44) Successfully sent >>>> /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db >>>> to /10.0.1.8 >>> I'm not sure, but perhaps that file contained data for two different >>> ranges? >> Does it mean that if I have huge file (e.g. 20GB) which contain a lot of >> ranges (let's say 100) it will be transferred each time (20GB*100)? >>> >>>> 4. When I enable/disable shuffle I receive warning message such as >>>> below. Do I need to worry about it? >>>> >>>> cassandra-shuffle -h localhost disable >>>> Failed to enable shuffling on 10.0.1.1! >>>> Failed to enable shuffling on 10.0.1.3! >>> Is that the verbatim output? Did it report failing to enable when you >>> tried to disable? >> Yes, this is verbatim output. It reports failure for enable as well as >> disable. Nodes .1.1 and .1.3 were not RELOCATING unless I ran >> cassandra-shuffle enable command on them locally. >>> >>> As a rule of thumb though, you don't want an disable/enable to result in >>> only a subset of nodes shuffling. Are there no other errors? What do >>> the logs say? >> No errors in logs. Only INFO about streams and WARN about relocation. >>> >>>> I couldn't find many docs on shuffle, only read through JIRA and >>>> original proposal by Eric. >> >