Re: Cassandra nodes reduce disks per node

Alain RODRIGUEZ Thu, 25 Feb 2016 00:22:03 -0800

For what it is worth, I finally wrote a blog post about this -->
http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html


If you are not done yet, every step is detailed in there.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>> have --delete-before when you're copying data to a temp (assumed empty)
>> directory?
>
>
>  Since they are immutable I do a first sync while everything is up and
>> running to the new location which runs really long. Meanwhile new ones are
>> created and I sync them again online, much less files to copy now. After
>> that I shutdown the node and my last rsync now has to copy only a few files
>> which is quite fast and so the downtime for that node is within minutes.
>
>
> Jan guess is right. Except for the "immutable" thing. Compaction can make
> big files go away, replaced by bigger ones you'll have to stream again.
>
> Here is a detailed explanation about what I did it this way.
>
> More precisely, let's say we have 10 files of 100 GB on the disk to remove
> (let's say 'old-dir')
>
> I run a first rsync to an empty folder indeed (let's call this 'tmp-dir'),
> in the disk that will remain after the operation. Let's say this takes
> about 10 hours. This can be run in parallel though.
>
> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one
> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB.
>
> At this point I disable compaction, stop running ones.
>
> My second rsync has to remove the 4 files that were compacted from
> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir needs
> to be mirroring old-dir, this is fine. This new operation takes 3.5 hours,
> also runnable in parallel (Keep in mind C* won't compact anything for 3.5
> hours, that's why I did not stopped compaction before the first rsync, in
> my case dataset was 2 TB big)
>
> At this point I have 950 GB in tmp-dir, but meanwhile clients continued to
> write on the disk. let's say 50 GB more.
>
> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add the
> diff to tmp-dir. Still runnable in parallel.
>
> Then the script stop the node, so should be run sequentially, and perform
> 2 more rsync, the first one to take the diff between end of 3rd rsync and
> the moment you stop the node, should be a few seconds, minutes maybe,
> depending how fast you ran the script after 3rd rsync ended. The second
> rsync in the script is a 'useless' one. I just like to control things. I
> run it, expect to see it to say that there is no diff. It is just a way to
> stop the script if for some reason data is still being appended to old-dir.
>
> Then I just move all the files from tmp-dir to new-dir (the proper data
> dir remaining after the operation). This is an instant op a files are not
> really moved as they already are on disk. That's due to system files
> property.
>
> I finally unmount and rm -rf old-dir.
>
> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
> nodes are down for about 5-10 min.
>
> VS
>
> Straight forward way (stop node, move, start node) : 10 h * number of node
> as this needs to be sequential. Plus each node is down for 10 hours, you
> have to repair them as it is higher than hinted handoff limit...
>
> Branton, I did not went through your process, but I guess you will be able
> to review it by yourself after reading the above (typically, repair is not
> needed if you use the strategy I describe above, as node is down for 5-10
> minutes). Also, not sure how "rsync -azvuiP /var/data/cassandra/data2/
> /var/data/cassandra/data/" will behave, my guess i this is going to do a
> copy, so this might be very long. My script perform an instant move and as
> the next command is 'rm -Rf /var/data/cassandra/data2' I see no reason
> copying rather than moving files.
>
> Your solution would probably work, but with big constraints on operational
> point of view (very long operation + repair needed)
>
> Hope this long email will be useful, maybe should I blog about this. Let
> me know if the process above makes sense or if some things might be
> improved.
>
> C*heers,
> -----------------
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 7:19 GMT+01:00 Branton Davis <branton.da...@spanning.com>:
>
>> Jan, thanks!  That makes perfect sense to run a second time before
>> stopping cassandra.  I'll add that in when I do the production cluster.
>>
>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten <j.kes...@enercast.de>
>> wrote:
>>
>>> Hi Branton,
>>>
>>> two cents from me - I didnt look through the script, but for the rsyncs
>>> I do pretty much the same when moving them. Since they are immutable I do a
>>> first sync while everything is up and running to the new location which
>>> runs really long. Meanwhile new ones are created and I sync them again
>>> online, much less files to copy now. After that I shutdown the node and my
>>> last rsync now has to copy only a few files which is quite fast and so the
>>> downtime for that node is within minutes.
>>>
>>> Jan
>>>
>>>
>>>
>>> Von meinem iPhone gesendet
>>>
>>> Am 18.02.2016 um 22:12 schrieb Branton Davis <branton.da...@spanning.com
>>> >:
>>>
>>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>>> have --delete-before when you're copying data to a temp (assumed empty)
>>> directory?
>>>
>>> On Thu, Feb 18, 2016 at 4:12 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>>> I did the process a few weeks ago and ended up writing a runbook and a
>>>> script. I have anonymised and share it fwiw.
>>>>
>>>> https://github.com/arodrime/cassandra-tools/tree/master/remove_disk
>>>>
>>>> It is basic bash. I tried to have the shortest down time possible,
>>>> making this a bit more complex, but it allows you to do a lot in parallel
>>>> and just do a fast operation sequentially, reducing overall operation time.
>>>>
>>>> This worked fine for me, yet I might have make some errors while making
>>>> it configurable though variables. Be sure to be around if you decide to run
>>>> this. Also I automated this more by using knife (Chef), I hate to repeat
>>>> ops, this is something you might want to consider.
>>>>
>>>> Hope this is useful,
>>>>
>>>> C*heers,
>>>> -----------------
>>>> Alain Rodriguez
>>>> France
>>>>
>>>> The Last Pickle
>>>> http://www.thelastpickle.com
>>>>
>>>> 2016-02-18 8:28 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>>>>
>>>>> Hey Branton,
>>>>>
>>>>> Please do let us know if you face any problems  doing this.
>>>>>
>>>>> Thanks
>>>>> anishek
>>>>>
>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Branton Davis <
>>>>> branton.da...@spanning.com> wrote:
>>>>>
>>>>>> We're about to do the same thing.  It shouldn't be necessary to shut
>>>>>> down the entire cluster, right?
>>>>>>
>>>>>> On Wed, Feb 17, 2016 at 12:45 PM, Robert Coli <rc...@eventbrite.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 16, 2016 at 11:29 PM, Anishek Agarwal <anis...@gmail.com
>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> To accomplish this can I just copy the data from disk1 to disk2
>>>>>>>> with in the relevant cassandra home location folders, change the
>>>>>>>> cassanda.yaml configuration and restart the node. before starting i 
>>>>>>>> will
>>>>>>>> shutdown the cluster.
>>>>>>>>
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>> =Rob
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra nodes reduce disks per node

Reply via email to