Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Gregory Farnum Mon, 26 Feb 2018 14:02:22 -0800

On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu> wrote:


> 2018-02-24 7:10 GMT+01:00 David Turner <drakonst...@gmail.com>:
>
>> Caspar, it looks like your idea should work. Worst case scenario seems
>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>> the idea to weight them to 0, backfilling, then recreate the osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system to
> not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>

I kicked off a brief discussion about this with some of the BlueStore guys
and they're aware of the problem with migrating across SSDs, but so far
it's just a Trello card:
https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
They do confirm you should be okay with dd'ing things across, assuming
symlinks get set up correctly as David noted.

I've got some other bad news, though: BlueStore has internal metadata about
the size of the block device it's using, so if you copy it onto a larger
block device, it will not actually make use of the additional space. :(
-Greg


>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider it to be a regression
>> of bluestore.
>>
>>
>>
>>
>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> A very interesting question and I would add the follow up question:
>>>
>>> Is there an easy way to add an external DB/WAL devices to an existing
>>> OSD?
>>>
>>> I suspect that it might be something on the lines of:
>>>
>>> - stop osd
>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>> - (maybe run some kind of osd mkfs ?)
>>> - start osd
>>>
>>> Has anyone done this so far or recommendations on how to do it?
>>>
>>> Which also makes me wonder: what is actually the format of WAL and
>>> BlockDB in bluestore? Is there any documentation available about it?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>> Caspar Smit <caspars...@supernas.eu> writes:
>>>
>>> > Hi All,
>>> >
>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>> (when it
>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>> >
>>> > It hosts DB partitions for 5 OSD's
>>> >
>>> > Maybe something like:
>>> >
>>> > 1) ceph osd reweight 0 the 5 OSD's
>>> > 2) let backfilling complete
>>> > 3) destroy/remove the 5 OSD's
>>> > 4) replace SSD
>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>> >
>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>> so i
>>> > thought maybe the following would work:
>>> >
>>> > 1) ceph osd set noout
>>> > 2) stop the 5 OSD's (systemctl stop)
>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>> > 4) remove the old SSD
>>> > 5) start the 5 OSD's (systemctl start)
>>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>>> and
>>> > now)
>>> > 6) ceph osd unset noout
>>> >
>>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>>> nr/uuid
>>> > stuff preventing this to work?
>>> >
>>> > Or is there another 'less hacky' way to replace a DB SSD without
>>> moving too
>>> > much data?
>>> >
>>> > Kind regards,
>>> > Caspar
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> --
>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Reply via email to