David, Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to inform other people that Ceph's default of 1GB is pretty low. Now that i read my own sentence it indeed looks as if i was using 1GB partitions, sorry for the confusion.
Caspar 2018-02-27 14:11 GMT+01:00 David Turner <drakonst...@gmail.com>: > If you're only using a 1GB DB partition, there is a very real possibility > it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB > so for a 4TB osd a 40GB DB should work for most use cases (except loads and > loads of small files). There are a few threads that mention how to check > how much of your DB partition is in use. Once it's full, it spills over to > the HDD. > > > On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <caspars...@supernas.eu> wrote: > >> 2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfar...@redhat.com>: >> >>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu> >>> wrote: >>> >>>> 2018-02-24 7:10 GMT+01:00 David Turner <drakonst...@gmail.com>: >>>> >>>>> Caspar, it looks like your idea should work. Worst case scenario seems >>>>> like the osd wouldn't start, you'd put the old SSD back in and go back to >>>>> the idea to weight them to 0, backfilling, then recreate the osds. >>>>> Definitely with a try in my opinion, and I'd love to hear your experience >>>>> after. >>>>> >>>>> >>>> Hi David, >>>> >>>> First of all, thank you for ALL your answers on this ML, you're really >>>> putting a lot of effort into answering many questions asked here and very >>>> often they contain invaluable information. >>>> >>>> >>>> To follow up on this post i went out and built a very small (proxmox) >>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD. >>>> And it worked! >>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based >>>> OSD's) >>>> >>>> Here's what i did on 1 node: >>>> >>>> 1) ceph osd set noout >>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2 >>>> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log >>>> 4) removed the old SSD physically from the node >>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in >>>> 6) ceph osd unset noout >>>> >>>> I assume that once the ddrescue step is finished a 'partprobe' or >>>> something similar is triggered and udev finds the DB partitions on the new >>>> SSD and starts the OSD's again (kind of what happens during hotplug) >>>> So it is probably better to clone the SSD in another (non-ceph) system >>>> to not trigger any udev events. >>>> >>>> I also tested a reboot after this and everything still worked. >>>> >>>> >>>> The old SSD was 120GB and the new is 256GB (cloning took around 4 >>>> minutes) >>>> Delta of data was very low because it was a test cluster. >>>> >>>> All in all the OSD's in question were 'down' for only 5 minutes (so i >>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and >>>> didn't actually need to set noout :) >>>> >>> >>> I kicked off a brief discussion about this with some of the BlueStore >>> guys and they're aware of the problem with migrating across SSDs, but so >>> far it's just a Trello card: https://trello.com/c/ >>> 9cxTgG50/324-bluestore-add-remove-resize-wal-db >>> They do confirm you should be okay with dd'ing things across, assuming >>> symlinks get set up correctly as David noted. >>> >>> >> Great that it is on the radar to address. This method feels hacky. >> >> >>> I've got some other bad news, though: BlueStore has internal metadata >>> about the size of the block device it's using, so if you copy it onto a >>> larger block device, it will not actually make use of the additional space. >>> :( >>> -Greg >>> >> >> Yes, i was well aware of that, no problem. The reason was the smaller SSD >> sizes are simply not being made anymore or discontinued by the manufacturer. >> Would be nice though if the DB size could be resized in the future, the >> default 1GB DB size seems very small to me. >> >> Caspar >> >> >>> >>> >>>> >>>> Kind regards, >>>> Caspar >>>> >>>> >>>> >>>>> Nico, it is not possible to change the WAL or DB size, location, etc >>>>> after osd creation. If you want to change the configuration of the osd >>>>> after creation, you have to remove it from the cluster and recreate it. >>>>> There is no similar functionality to how you could move, recreate, etc >>>>> filesystem osd journals. I think this might be on the radar as a feature, >>>>> but I don't know for certain. I definitely consider it to be a regression >>>>> of bluestore. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius < >>>>> nico.schottel...@ungleich.ch> wrote: >>>>> >>>>>> >>>>>> A very interesting question and I would add the follow up question: >>>>>> >>>>>> Is there an easy way to add an external DB/WAL devices to an existing >>>>>> OSD? >>>>>> >>>>>> I suspect that it might be something on the lines of: >>>>>> >>>>>> - stop osd >>>>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device >>>>>> - (maybe run some kind of osd mkfs ?) >>>>>> - start osd >>>>>> >>>>>> Has anyone done this so far or recommendations on how to do it? >>>>>> >>>>>> Which also makes me wonder: what is actually the format of WAL and >>>>>> BlockDB in bluestore? Is there any documentation available about it? >>>>>> >>>>>> Best, >>>>>> >>>>>> Nico >>>>>> >>>>>> >>>>>> Caspar Smit <caspars...@supernas.eu> writes: >>>>>> >>>>>> > Hi All, >>>>>> > >>>>>> > What would be the proper way to preventively replace a DB/WAL SSD >>>>>> (when it >>>>>> > is nearing it's DWPD/TBW limit and not failed yet). >>>>>> > >>>>>> > It hosts DB partitions for 5 OSD's >>>>>> > >>>>>> > Maybe something like: >>>>>> > >>>>>> > 1) ceph osd reweight 0 the 5 OSD's >>>>>> > 2) let backfilling complete >>>>>> > 3) destroy/remove the 5 OSD's >>>>>> > 4) replace SSD >>>>>> > 5) create 5 new OSD's with seperate DB partition on new SSD >>>>>> > >>>>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be >>>>>> moved so i >>>>>> > thought maybe the following would work: >>>>>> > >>>>>> > 1) ceph osd set noout >>>>>> > 2) stop the 5 OSD's (systemctl stop) >>>>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size >>>>>> > 4) remove the old SSD >>>>>> > 5) start the 5 OSD's (systemctl start) >>>>>> > 6) let backfilling/recovery complete (only delta data between OSD >>>>>> stop and >>>>>> > now) >>>>>> > 6) ceph osd unset noout >>>>>> > >>>>>> > Would this be a viable method to replace a DB SSD? Any udev/serial >>>>>> nr/uuid >>>>>> > stuff preventing this to work? >>>>>> > >>>>>> > Or is there another 'less hacky' way to replace a DB SSD without >>>>>> moving too >>>>>> > much data? >>>>>> > >>>>>> > Kind regards, >>>>>> > Caspar >>>>>> > _______________________________________________ >>>>>> > ceph-users mailing list >>>>>> > ceph-users@lists.ceph.com >>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>>> >>>>>> -- >>>>>> Modern, affordable, Swiss Virtual Machines. Visit >>>>>> www.datacenterlight.ch >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com