Re: Problems with shuffle

aaron morton Sun, 14 Apr 2013 12:52:25 -0700

> How does Cassandra with vnodes exactly decide how many vnodes to move?
The num_tokens setting in the yaml file. What did you set this to?


Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/04/2013, at 11:56 AM, Rustam Aliyev <rustam.li...@code.az> wrote:

> Just a followup on this issue. Due to the cost of shuffle, we decided not to 
> do it. Recently, we added new node and ended up in not well balanced cluster:
> 
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address           Load       Tokens  Owns   Host ID                       
>         Rack
> UN  10.0.1.8          52.28 GB   260     18.3%  
> d28df6a6-c888-4658-9be1-f9e286368dce  rack1
> UN  10.0.1.11         55.21 GB   256     9.4%   
> 7b0cf3c8-0c42-4443-9b0c-68f794299443  rack1
> UN  10.0.1.2          49.03 GB   259     17.9%  
> 2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f  rack1
> UN  10.0.1.4          48.51 GB   255     18.4%  
> c253dcdf-3e93-495c-baf1-e4d2a033bce3  rack1
> UN  10.0.1.1          67.14 GB   253     17.9%  
> 4f77fd70-b134-486b-9c25-cfea96b6d412  rack1
> UN  10.0.1.3          47.65 GB   253     18.0%  
> 4d03690d-5363-42c1-85c2-5084596e09fc  rack1
> 
> It looks like new node took from each other node equal amount of vnodes - 
> which is good. However, it's not clear why it decided to have twice less than 
> other nodes.
> 
> How does Cassandra with vnodes exactly decide how many vnodes to move?
> 
> Btw, during JOINING nodetool status command does not show any information 
> about joining node. It appears only when join finished (on v1.2.3).
> 
> -- Rustam
> 
> 
> On 08/04/2013 22:33, Rustam Aliyev wrote:
>> After 2 days of endless compactions and streaming I had to stop this and 
>> cancel shuffle. One of the nodes even complained that there's no free disk 
>> space (grew from 30GB to 400GB). After all these problems number of the 
>> moved tokens were less than 40 (out of 1280!). 
>> 
>> Now, when nodes start they report duplicate ranges. I wonder how bad is that 
>> and how do I get rid of that? 
>> 
>>  INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java (line 
>> 1386) Nodes /10.0.1.2 and /10.0.1.1 have the same token 
>> 99027485685976232531333625990885670910.  Ignoring /10.0.1.2 
>>  INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java (line 
>> 1386) Nodes /10.0.1.2 and /10.0.1.4 have the same token 
>> 43199909863009765869373729459111198718.  Ignoring /10.0.1.2 
>> 
>> Overall, I'm not sure how bad it is to leave data unshuffled (I read 
>> DataStax blog post, not clear). When adding new node wouldn't it be assigned 
>> ranges randomly from all nodes? 
>> 
>> Some other notes inline below: 
>> 
>> On 08/04/2013 15:00, Eric Evans wrote: 
>>> [ Rustam Aliyev ] 
>>>> Hi, 
>>>> 
>>>> After upgrading to the vnodes I created and enabled shuffle 
>>>> operation as suggested. After running for a couple of hours I had to 
>>>> disable it because nodes were not catching up with compactions. I 
>>>> repeated this process 3 times (enable/disable). 
>>>> 
>>>> I have 5 nodes and each of them had ~35GB. After shuffle operations 
>>>> described above some nodes are now reaching ~170GB. In the log files 
>>>> I can see same files transferred 2-4 times to the same host within 
>>>> the same shuffle session. Worst of all, after all of these I had 
>>>> only 20 vnodes transferred out of 1280. So if it will continue at 
>>>> the same speed it will take about a month or two to complete 
>>>> shuffle. 
>>> As Edward says, you'll need to issue a cleanup post-shuffle if you expect 
>>> to see disk usage match your expectations. 
>>> 
>>>> I had few question to better understand shuffle: 
>>>> 
>>>> 1. Does disabling and re-enabling shuffle starts shuffle process from 
>>>>     scratch or it resumes from the last point? 
>>> It resumes. 
>>> 
>>>> 2. Will vnode reallocations speedup as shuffle proceeds or it will 
>>>>     remain the same? 
>>> The shuffle proceeds synchronously, 1 range at a time; It's not going to 
>>> speed up as it progresses. 
>>> 
>>>> 3. Why I see multiple transfers of the same file to the same host? e.g.: 
>>>> 
>>>>     INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038 
>>>>     StreamReplyVerbHandler.java (line 44) Successfully sent 
>>>>     /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db 
>>>>     to /10.0.1.8 
>>>>     INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427 
>>>>     StreamReplyVerbHandler.java (line 44) Successfully sent 
>>>>     /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db 
>>>>     to /10.0.1.8 
>>> I'm not sure, but perhaps that file contained data for two different 
>>> ranges? 
>> Does it mean that if I have huge file (e.g. 20GB) which contain a lot of 
>> ranges (let's say 100) it will be transferred each time (20GB*100)? 
>>> 
>>>> 4. When I enable/disable shuffle I receive warning message such as 
>>>>     below. Do I need to worry about it? 
>>>> 
>>>>     cassandra-shuffle -h localhost disable 
>>>>     Failed to enable shuffling on 10.0.1.1! 
>>>>     Failed to enable shuffling on 10.0.1.3! 
>>> Is that the verbatim output?  Did it report failing to enable when you 
>>> tried to disable? 
>> Yes, this is verbatim output. It reports failure for enable as well as 
>> disable. Nodes .1.1 and .1.3 were not RELOCATING unless I ran 
>> cassandra-shuffle enable command on them locally. 
>>> 
>>> As a rule of thumb though, you don't want an disable/enable to result in 
>>> only a subset of nodes shuffling.  Are there no other errors?  What do 
>>> the logs say? 
>> No errors in logs. Only INFO about streams and WARN about relocation. 
>>> 
>>>> I couldn't find many docs on shuffle, only read through JIRA and 
>>>> original proposal by Eric. 
>> 
>

Re: Problems with shuffle

Reply via email to