Gotcha. As a side note, that setting is only used by ceph-disk as ceph-volume does not create partitions for the WAL or DB. You need to create those partitions manually if using anything other than a whole block device when creating OSDs with ceph-volume.
On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <caspars...@supernas.eu> wrote: > David, > > Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to > inform other people that Ceph's default of 1GB is pretty low. > Now that i read my own sentence it indeed looks as if i was using 1GB > partitions, sorry for the confusion. > > Caspar > > 2018-02-27 14:11 GMT+01:00 David Turner <drakonst...@gmail.com>: > >> If you're only using a 1GB DB partition, there is a very real possibility >> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB >> so for a 4TB osd a 40GB DB should work for most use cases (except loads and >> loads of small files). There are a few threads that mention how to check >> how much of your DB partition is in use. Once it's full, it spills over to >> the HDD. >> >> >> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <caspars...@supernas.eu> wrote: >> >>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfar...@redhat.com>: >>> >>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu> >>>> wrote: >>>> >>>>> 2018-02-24 7:10 GMT+01:00 David Turner <drakonst...@gmail.com>: >>>>> >>>>>> Caspar, it looks like your idea should work. Worst case scenario >>>>>> seems like the osd wouldn't start, you'd put the old SSD back in and go >>>>>> back to the idea to weight them to 0, backfilling, then recreate the >>>>>> osds. >>>>>> Definitely with a try in my opinion, and I'd love to hear your experience >>>>>> after. >>>>>> >>>>>> >>>>> Hi David, >>>>> >>>>> First of all, thank you for ALL your answers on this ML, you're really >>>>> putting a lot of effort into answering many questions asked here and very >>>>> often they contain invaluable information. >>>>> >>>>> >>>>> To follow up on this post i went out and built a very small (proxmox) >>>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL >>>>> SDD. >>>>> And it worked! >>>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based >>>>> OSD's) >>>>> >>>>> Here's what i did on 1 node: >>>>> >>>>> 1) ceph osd set noout >>>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2 >>>>> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log >>>>> 4) removed the old SSD physically from the node >>>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in >>>>> 6) ceph osd unset noout >>>>> >>>>> I assume that once the ddrescue step is finished a 'partprobe' or >>>>> something similar is triggered and udev finds the DB partitions on the new >>>>> SSD and starts the OSD's again (kind of what happens during hotplug) >>>>> So it is probably better to clone the SSD in another (non-ceph) system >>>>> to not trigger any udev events. >>>>> >>>>> I also tested a reboot after this and everything still worked. >>>>> >>>>> >>>>> The old SSD was 120GB and the new is 256GB (cloning took around 4 >>>>> minutes) >>>>> Delta of data was very low because it was a test cluster. >>>>> >>>>> All in all the OSD's in question were 'down' for only 5 minutes (so i >>>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and >>>>> didn't actually need to set noout :) >>>>> >>>> >>>> I kicked off a brief discussion about this with some of the BlueStore >>>> guys and they're aware of the problem with migrating across SSDs, but so >>>> far it's just a Trello card: >>>> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db >>>> They do confirm you should be okay with dd'ing things across, assuming >>>> symlinks get set up correctly as David noted. >>>> >>>> >>> Great that it is on the radar to address. This method feels hacky. >>> >>> >>>> I've got some other bad news, though: BlueStore has internal metadata >>>> about the size of the block device it's using, so if you copy it onto a >>>> larger block device, it will not actually make use of the additional space. >>>> :( >>>> -Greg >>>> >>> >>> Yes, i was well aware of that, no problem. The reason was the smaller >>> SSD sizes are simply not being made anymore or discontinued by the >>> manufacturer. >>> Would be nice though if the DB size could be resized in the future, the >>> default 1GB DB size seems very small to me. >>> >>> Caspar >>> >>> >>>> >>>> >>>>> >>>>> Kind regards, >>>>> Caspar >>>>> >>>>> >>>>> >>>>>> Nico, it is not possible to change the WAL or DB size, location, etc >>>>>> after osd creation. If you want to change the configuration of the osd >>>>>> after creation, you have to remove it from the cluster and recreate it. >>>>>> There is no similar functionality to how you could move, recreate, etc >>>>>> filesystem osd journals. I think this might be on the radar as a feature, >>>>>> but I don't know for certain. I definitely consider it to be a regression >>>>>> of bluestore. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius < >>>>>> nico.schottel...@ungleich.ch> wrote: >>>>>> >>>>>>> >>>>>>> A very interesting question and I would add the follow up question: >>>>>>> >>>>>>> Is there an easy way to add an external DB/WAL devices to an existing >>>>>>> OSD? >>>>>>> >>>>>>> I suspect that it might be something on the lines of: >>>>>>> >>>>>>> - stop osd >>>>>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device >>>>>>> - (maybe run some kind of osd mkfs ?) >>>>>>> - start osd >>>>>>> >>>>>>> Has anyone done this so far or recommendations on how to do it? >>>>>>> >>>>>>> Which also makes me wonder: what is actually the format of WAL and >>>>>>> BlockDB in bluestore? Is there any documentation available about it? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Nico >>>>>>> >>>>>>> >>>>>>> Caspar Smit <caspars...@supernas.eu> writes: >>>>>>> >>>>>>> > Hi All, >>>>>>> > >>>>>>> > What would be the proper way to preventively replace a DB/WAL SSD >>>>>>> (when it >>>>>>> > is nearing it's DWPD/TBW limit and not failed yet). >>>>>>> > >>>>>>> > It hosts DB partitions for 5 OSD's >>>>>>> > >>>>>>> > Maybe something like: >>>>>>> > >>>>>>> > 1) ceph osd reweight 0 the 5 OSD's >>>>>>> > 2) let backfilling complete >>>>>>> > 3) destroy/remove the 5 OSD's >>>>>>> > 4) replace SSD >>>>>>> > 5) create 5 new OSD's with seperate DB partition on new SSD >>>>>>> > >>>>>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be >>>>>>> moved so i >>>>>>> > thought maybe the following would work: >>>>>>> > >>>>>>> > 1) ceph osd set noout >>>>>>> > 2) stop the 5 OSD's (systemctl stop) >>>>>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size >>>>>>> > 4) remove the old SSD >>>>>>> > 5) start the 5 OSD's (systemctl start) >>>>>>> > 6) let backfilling/recovery complete (only delta data between OSD >>>>>>> stop and >>>>>>> > now) >>>>>>> > 6) ceph osd unset noout >>>>>>> > >>>>>>> > Would this be a viable method to replace a DB SSD? Any udev/serial >>>>>>> nr/uuid >>>>>>> > stuff preventing this to work? >>>>>>> > >>>>>>> > Or is there another 'less hacky' way to replace a DB SSD without >>>>>>> moving too >>>>>>> > much data? >>>>>>> > >>>>>>> > Kind regards, >>>>>>> > Caspar >>>>>>> > _______________________________________________ >>>>>>> > ceph-users mailing list >>>>>>> > ceph-users@lists.ceph.com >>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Modern, affordable, Swiss Virtual Machines. Visit >>>>>>> www.datacenterlight.ch >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com