> On May 27, 2025, at 12:19 PM, Ryan Rempel <rgrem...@cmu.ca> wrote: > > I'm expanding a small Ceph cluster from 4 nodes to 5 nodes. The new node is a > bit more sophisticated than the others, since it has some SSD storage that > I'd like to use for DB+WAL (which I haven't done before, it has just been > rotational disks). > > I'm using cephadm for orchestration, and normally add osds via "ceph orch > daemon add osd". I prefer to add the osds in this "manual" way (rather than > "ceph orch apply" with a spec) mainly because my infrastructure is not > uniform (for better or worse, I'm working with hardware that becomes > available in different ways over time, as I gradually upgrade things and add > things). > > Looking at this page: > > https://docs.ceph.com/en/squid/cephadm/services/osd/ > > ... it isn't entirely clear to me whether it's possible to specify a separate > DB device when using the "ceph orch daemon add osd" procedure. There is a > description of how to do it with a service spec, but how you would specify > the DB device for "ceph orch daemon add osd" does not appear to be described. > > So, my first question is whether it's possible to specify a separate DB via > "ceph orch daemon add osd"?
I believe it is, don’t have the syntax to hand. > If not, I'll need to explore the service spec approach. I suppose I can use > the "unmanaged: true" option in the spec (to keep it as "manual" as possible). I do suggest leaving OSD services unmanaged except when you’re using them. So that e.g. when you zap an OSD for replacement the old / bad drive isn’t automatically redeployed. An OSD service won’t mess with existing OSDs, so you don’t have to worry about applying a spec that differs from the existing OSDs. Use the —dry-run flag before applying a new spec to ensure that the effect is what you want. > > The remaining puzzle is how to use the SSD as DB+WAL for more than one OSD. > At this point, the SSD is the raw device — I haven't done anything with it > manually in LVM or whatever. In the service spec description above, I see > that there is a "db_slots" key. So, I suppose that I could specify the > "whole" SSD and provide for the number of slots? That’s the idea. > However, I don't necessarily want every slot to be the same size (because of > my unfortunately heterogeneous hardware). You can have multiple osd specs. In the docs scroll down to the advanced osd spec section for examples. You can constrain multiple specs There’s a lot of flexibility. The slots on a given osd host will be the same size, though, right? Here’s an example of a cluster being retrofitted. First a couple of new nodes were added with SATA/SAS SSDs for HDD WAL+DB offload, then additional HDDs and offload SSDs were added to the existing nodes along with a couple of NVMe SSDs to be used only for CephFS metadata, not for offload. # This describes the prior strategy. Note the use of the `size` and `rotational` attributes to prevent the smaller SSDs from being used # as OSDs. This host_pattern started out as * and was whittled down as each host was migrated --- service_type: osd service_id: cost_capacity service_name: osd.cost_capacity placement: host_pattern: host24 spec: data_devices: rotational: 1 size: '18T:' filter_logic: AND objectstore: bluestore — # This spec matches only the 1.9TB NVMe SSDs to be # used as OSDs for the CephFS metadata pool # Here again we use `rotational` and `size` to constrain application # since the WAL+DB offload SSDs are 2TB service_type: osd service_id: dashboard-admin-1705602677615 service_name: osd.dashboard-admin-1705602677615 placement: host_pattern: `*` spec: data_devices: rotational: 0 size: 490G:1200G filter_logic: AND objectstore: bluestore — # Here hybrid OSDs are deployed on specific # device names, which isn’t ideal in general because # the names may change service_type: osd service_id: osd.hybrid service_name: osd.osd.hybrid unmanaged: true placement: hosts: - host1701 spec: block_db_size: 384075772723 data_devices: paths: - /dev/sdc - /dev/sdd - /dev/sde - /dev/sdf - /dev/sdg db_devices: paths: - /dev/sdac db_slots: 5 filter_logic: AND objectstore: bluestore --- > So, I also see that there is a "block_db_size" and "block_wal_size". But it's > unclear how this relates to "db_slots" — which one would determine how the > SSD is sliced up? In general ignore the WAL and it’ll default to riding along with the DB. In theory db_size and db_slots might be either/or, or perhaps there might be an SSD that services both for offload and other purposes, though I wouldn’t recommend that. > I'd actually be happy to pre-slice the SSD (e.g. with LVM) and then directly > specify which SSD slice is the DB+WAL for which OSD, if that's a feasible > approach. That works. Here’s a one-off script I’ve used to do this. I know, no error checking. The args are 5x HDD block devices of existing OSDs followed by an offload device. 20% of the device is used for each offload. #!/bin/bash ceph osd set noscrub ceph osd set nodeep-scrub sleep 15 VG=`uname -n`-$6-db vgcreate $VG /dev/$6 for i in $1 $2 $3 $4 $5 ; do lvcreate -l 20%VG -n ceph-osd$i-db $VG ; done vgdisplay $VG for i in $1 $2 $3 $4 $5 ; do ceph osd add-noout $i ; done CFSID=$(ceph fsid) for i in $1 $2 $3 $4 $5 do systemctl stop ceph-$CFSID@osd.$i date echo ceph-volume lvm new-db --osd-id $i --osd-fsid $(ceph osd find $i | jq -r .osd_fsid) --target $VG/ceph-osd$i-db \; exit | cephadm shell --name osd.$i date echo ceph-volume lvm migrate --osd-id $i --osd-fsid $(ceph osd find $i | jq -r .osd_fsid) --target $VG/ceph-osd$i-db --from data \; exit | cephadm shell --name osd.$i date systemctl start ceph-$CFSID@osd.$i done > Though, I'd still be interested in knowing whether I need to set something > for "block_db_size" and "block_wal_size", or whether it's enough to just > actually make a certain size of LVM volume available for DB+WAL. If you pre-create, then you don’t need the size params. > > Normally I'd just experiment, but that might be disruptive to the working > cluster. I guess I could at least turn off rebalancing while I try things out? Yes, turn off rebalancing so that you can validate the results, in case you have to zap and start over. And use —dry-run a lot, and leave the osd service(s) unmanaged except when you’re actively using. > > The other documentation I'm now reading is the documentation for ceph-volume, > which appears to be related: > > https://docs.ceph.com/en/squid/ceph-volume/lvm/batch/ > > It mentions, for instance, things like db_slots and block_db_size. The > implication is that db_slots is an alternative to block_db_size — that you > wouldn't specify both, for instance. In most cases I would agree. > > I'm also reading the ceph-volume docs for "prepare". I suppose if I find that > more suitable, it might be possible to "prepare" and OSD with ceph-volume and > then "adopt" it with cephadm? There might be snags with that approach. Adoption I think is intended for legacy OSDs. > > Well, just writing the email has given me a bit more clarity about things to > try, but I'd certainly be happy for any guidance. > > > Ryan Rempel > > Director of Information Technology > > Canadian Mennonite University > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io