Re: [ceph-users] Best way to replace OSD

Reed Dier Mon, 06 Aug 2018 10:16:12 -0700

These SSD’s are definitely up to the task, 3-5 DWPD over 5 years, however I 
mostly use an abundance of caution and try to minimize unnecessary data 
movement so as not to exacerbate things.


I definitely could, I just er on the side of conservative wear.

Reed

> On Aug 6, 2018, at 11:19 AM, Richard Hesketh <richard.hesk...@rd.bbc.co.uk> 
> wrote:
> 
> I would have thought that with the write endurance on modern SSDs,
> additional write wear from the occasional rebalance would honestly be
> negligible? If you're hitting them hard enough that you're actually
> worried about your write endurance, a rebalance or two is peanuts
> compared to your normal I/O. If you're not, then there's more than
> enough write endurance in an SSD to handle daily rebalances for years.
> 
> On 06/08/18 17:05, Reed Dier wrote:
>> This has been my modus operandi when replacing drives.
>> 
>> Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
>> process, and in the case of SSD’s, shuffling data adds unnecessary write 
>> wear to the disks.
>> 
>> When migrating from filestore to bluestore, I would actually forklift an 
>> entire failure domain using the below script, and the noout, norebalance, 
>> norecover flags.
>> 
>> This would keep crush from pushing data around until I had all of the drives 
>> replaced, and would then keep crush from trying to recover until I was ready.
>> 
>>> # $1 use $ID or osd.id
>>> # $2 use $DATA or /dev/sdx
>>> # $3 use $NVME or /dev/nvmeXnXpX
>>> 
>>> sudo systemctl stop ceph-osd@$1.service
>>> sudo ceph-osd -i $1 --flush-journal
>>> sudo umount /var/lib/ceph/osd/ceph-$1
>>> sudo ceph-volume lvm zap /dev/$2
>>> ceph osd crush remove osd.$1
>>> ceph auth del osd.$1
>>> ceph osd rm osd.$1
>>> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3
>> 
>> For a single drive, this would stop it, remove it from crush, make a new one 
>> (and let it retake the old/existing osd.id), and then after I unset the 
>> norebalance/norecover flags, then it backfills from the other copies to the 
>> replaced drive, and doesn’t move data around.
>> That script is specific for filestore to bluestore somewhat, as the 
>> flush-journal command is no longer used in bluestore.
>> 
>> Hope thats helpful.
>> 
>> Reed
>> 
>>> On Aug 6, 2018, at 9:30 AM, Richard Hesketh <richard.hesk...@rd.bbc.co.uk> 
>>> wrote:
>>> 
>>> Waiting for rebalancing is considered the safest way, since it ensures
>>> you retain your normal full number of replicas at all times. If you take
>>> the disk out before rebalancing is complete, you will be causing some
>>> PGs to lose a replica. That is a risk to your data redundancy, but it
>>> might be an acceptable one if you prefer to just get the disk replaced
>>> quickly.
>>> 
>>> Personally, if running at 3+ replicas, briefly losing one isn't the end
>>> of the world; you'd still need two more simultaneous disk failures to
>>> actually lose data, though one failure would cause inactive PGs (because
>>> you are running with min_size >= 2, right?). If running pools with only
>>> two replicas at size = 2 I absolutely would not remove a disk without
>>> waiting for rebalancing unless that disk was actively failing so badly
>>> that it was making rebalancing impossible.
>>> 
>>> Rich
>>> 
>>> On 06/08/18 15:20, Josef Zelenka wrote:
>>>> Hi, our procedure is usually(assured that the cluster was ok the
>>>> failure, with 2 replicas as crush rule)
>>>> 
>>>> 1.Stop the OSD process(to keep it from coming up and down and putting
>>>> load on the cluster)
>>>> 
>>>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>>>> can be set manually but i let it happen by itself)
>>>> 
>>>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>>>> ceph osd rm)
>>>> 
>>>> 4. note down the journal partitions if needed
>>>> 
>>>> 5. umount drive, replace the disk with new one
>>>> 
>>>> 6. ensure permissions are set to ceph:ceph in /dev
>>>> 
>>>> 7. mklabel gpt on the new drive
>>>> 
>>>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>>>> the crushmap)
>>>> 
>>>> 
>>>> your procedure sounds reasonable to me, as far as i'm concerned you
>>>> shouldn't have to wait for rebalancing after you remove the osd. all
>>>> this might not be 100% per ceph books but it works for us :)
>>>> 
>>>> Josef
>>>> 
>>>> 
>>>> On 06/08/18 16:15, Iztok Gregori wrote:
>>>>> Hi Everyone,
>>>>> 
>>>>> Which is the best way to replace a failing (SMART Health Status:
>>>>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>>>> 
>>>>> Normally I will:
>>>>> 
>>>>> 1. set the OSD as out
>>>>> 2. wait for rebalancing
>>>>> 3. stop the OSD on the osd-server (unmount if needed)
>>>>> 4. purge the OSD from CEPH
>>>>> 5. physically replace the disk with the new one
>>>>> 6. with ceph-deploy:
>>>>> 6a   zap the new disk (just in case)
>>>>> 6b   create the new OSD
>>>>> 7. add the new osd to the crush map.
>>>>> 8. wait for rebalancing.
>>>>> 
>>>>> My questions are:
>>>>> 
>>>>> - Is my procedure reasonable?
>>>>> - What if I skip the #2 and instead to wait for rebalancing I directly
>>>>> purge the OSD?
>>>>> - Is better to reweight the OSD before take it out?
>>>>> 
>>>>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>>>>> is host.
>>>>> 
>>>>> Thanks,
>>>>> Iztok
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best way to replace OSD

Reply via email to