Re: [ceph-users] Proper procedure to replace DB/WAL SSD

David Turner Tue, 27 Feb 2018 07:25:56 -0800

Gotcha.  As a side note, that setting is only used by ceph-disk as
ceph-volume does not create partitions for the WAL or DB.  You need to
create those partitions manually if using anything other than a whole block
device when creating OSDs with ceph-volume.


On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <caspars...@supernas.eu> wrote:

> David,
>
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to
> inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using 1GB
> partitions, sorry for the confusion.
>
> Caspar
>
> 2018-02-27 14:11 GMT+01:00 David Turner <drakonst...@gmail.com>:
>
>> If you're only using a 1GB DB partition, there is a very real possibility
>> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
>> so for a 4TB osd a 40GB DB should work for most use cases (except loads and
>> loads of small files). There are a few threads that mention how to check
>> how much of your DB partition is in use. Once it's full, it spills over to
>> the HDD.
>>
>>
>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <caspars...@supernas.eu> wrote:
>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfar...@redhat.com>:
>>>
>>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu>
>>>> wrote:
>>>>
>>>>> 2018-02-24 7:10 GMT+01:00 David Turner <drakonst...@gmail.com>:
>>>>>
>>>>>> Caspar, it looks like your idea should work. Worst case scenario
>>>>>> seems like the osd wouldn't start, you'd put the old SSD back in and go
>>>>>> back to the idea to weight them to 0, backfilling, then recreate the 
>>>>>> osds.
>>>>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>>>>> after.
>>>>>>
>>>>>>
>>>>> Hi David,
>>>>>
>>>>> First of all, thank you for ALL your answers on this ML, you're really
>>>>> putting a lot of effort into answering many questions asked here and very
>>>>> often they contain invaluable information.
>>>>>
>>>>>
>>>>> To follow up on this post i went out and built a very small (proxmox)
>>>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL 
>>>>> SDD.
>>>>> And it worked!
>>>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based
>>>>> OSD's)
>>>>>
>>>>> Here's what i did on 1 node:
>>>>>
>>>>> 1) ceph osd set noout
>>>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>>>>> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log
>>>>> 4) removed the old SSD physically from the node
>>>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>>>>> 6) ceph osd unset noout
>>>>>
>>>>> I assume that once the ddrescue step is finished a 'partprobe' or
>>>>> something similar is triggered and udev finds the DB partitions on the new
>>>>> SSD and starts the OSD's again (kind of what happens during hotplug)
>>>>> So it is probably better to clone the SSD in another (non-ceph) system
>>>>> to not trigger any udev events.
>>>>>
>>>>> I also tested a reboot after this and everything still worked.
>>>>>
>>>>>
>>>>> The old SSD was 120GB and the new is 256GB (cloning took around 4
>>>>> minutes)
>>>>> Delta of data was very low because it was a test cluster.
>>>>>
>>>>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>>>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>>>>> didn't actually need to set noout :)
>>>>>
>>>>
>>>> I kicked off a brief discussion about this with some of the BlueStore
>>>> guys and they're aware of the problem with migrating across SSDs, but so
>>>> far it's just a Trello card:
>>>> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>>>> They do confirm you should be okay with dd'ing things across, assuming
>>>> symlinks get set up correctly as David noted.
>>>>
>>>>
>>> Great that it is on the radar to address. This method feels hacky.
>>>
>>>
>>>> I've got some other bad news, though: BlueStore has internal metadata
>>>> about the size of the block device it's using, so if you copy it onto a
>>>> larger block device, it will not actually make use of the additional space.
>>>> :(
>>>> -Greg
>>>>
>>>
>>> Yes, i was well aware of that, no problem. The reason was the smaller
>>> SSD sizes are simply not being made anymore or discontinued by the
>>> manufacturer.
>>> Would be nice though if the DB size could be resized in the future, the
>>> default 1GB DB size seems very small to me.
>>>
>>> Caspar
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Kind regards,
>>>>> Caspar
>>>>>
>>>>>
>>>>>
>>>>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>>>>> after osd creation. If you want to change the configuration of the osd
>>>>>> after creation, you have to remove it from the cluster and recreate it.
>>>>>> There is no similar functionality to how you could move, recreate, etc
>>>>>> filesystem osd journals. I think this might be on the radar as a feature,
>>>>>> but I don't know for certain. I definitely consider it to be a regression
>>>>>> of bluestore.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>>>>> nico.schottel...@ungleich.ch> wrote:
>>>>>>
>>>>>>>
>>>>>>> A very interesting question and I would add the follow up question:
>>>>>>>
>>>>>>> Is there an easy way to add an external DB/WAL devices to an existing
>>>>>>> OSD?
>>>>>>>
>>>>>>> I suspect that it might be something on the lines of:
>>>>>>>
>>>>>>> - stop osd
>>>>>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>>>>>> - (maybe run some kind of osd mkfs ?)
>>>>>>> - start osd
>>>>>>>
>>>>>>> Has anyone done this so far or recommendations on how to do it?
>>>>>>>
>>>>>>> Which also makes me wonder: what is actually the format of WAL and
>>>>>>> BlockDB in bluestore? Is there any documentation available about it?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Nico
>>>>>>>
>>>>>>>
>>>>>>> Caspar Smit <caspars...@supernas.eu> writes:
>>>>>>>
>>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>>>>>> (when it
>>>>>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>>>>>> >
>>>>>>> > It hosts DB partitions for 5 OSD's
>>>>>>> >
>>>>>>> > Maybe something like:
>>>>>>> >
>>>>>>> > 1) ceph osd reweight 0 the 5 OSD's
>>>>>>> > 2) let backfilling complete
>>>>>>> > 3) destroy/remove the 5 OSD's
>>>>>>> > 4) replace SSD
>>>>>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>>>>>> >
>>>>>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be
>>>>>>> moved so i
>>>>>>> > thought maybe the following would work:
>>>>>>> >
>>>>>>> > 1) ceph osd set noout
>>>>>>> > 2) stop the 5 OSD's (systemctl stop)
>>>>>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>>>>>> > 4) remove the old SSD
>>>>>>> > 5) start the 5 OSD's (systemctl start)
>>>>>>> > 6) let backfilling/recovery complete (only delta data between OSD
>>>>>>> stop and
>>>>>>> > now)
>>>>>>> > 6) ceph osd unset noout
>>>>>>> >
>>>>>>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>>>>>>> nr/uuid
>>>>>>> > stuff preventing this to work?
>>>>>>> >
>>>>>>> > Or is there another 'less hacky' way to replace a DB SSD without
>>>>>>> moving too
>>>>>>> > much data?
>>>>>>> >
>>>>>>> > Kind regards,
>>>>>>> > Caspar
>>>>>>> > _______________________________________________
>>>>>>> > ceph-users mailing list
>>>>>>> > ceph-users@lists.ceph.com
>>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Modern, affordable, Swiss Virtual Machines. Visit
>>>>>>> www.datacenterlight.ch
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Reply via email to