[ceph-users] Re: Adding OSD with separate DB via "ceph orch daemon add osd"

Anthony D'Atri Tue, 27 May 2025 10:31:39 -0700


> On May 27, 2025, at 12:19 PM, Ryan Rempel <rgrem...@cmu.ca> wrote:
> 
> I'm expanding a small Ceph cluster from 4 nodes to 5 nodes. The new node is a 
> bit more sophisticated than the others, since it has some SSD storage that 
> I'd like to use for DB+WAL (which I haven't done before, it has just been 
> rotational disks).
> 
> I'm using cephadm for orchestration, and normally add osds via "ceph orch 
> daemon add osd". I prefer to add the osds in this "manual" way (rather than 
> "ceph orch apply" with a spec) mainly because my infrastructure is not 
> uniform (for better or worse, I'm working with hardware that becomes 
> available in different ways over time, as I gradually upgrade things and add 
> things).
> 
> Looking at this page:
> 
> https://docs.ceph.com/en/squid/cephadm/services/osd/
> 
> ... it isn't entirely clear to me whether it's possible to specify a separate 
> DB device when using the "ceph orch daemon add osd" procedure. There is a 
> description of how to do it with a service spec, but how you would specify 
> the DB device for "ceph orch daemon add osd" does not appear to be described.
> 
> So, my first question is whether it's possible to specify a separate DB via 
> "ceph orch daemon add osd"?


I believe it is, don’t have the syntax to hand.


> If not, I'll need to explore the service spec approach. I suppose I can use 
> the "unmanaged: true" option in the spec (to keep it as "manual" as possible).

I do suggest leaving OSD services unmanaged except when you’re using them.  So 
that e.g. when you zap an OSD for replacement the old / bad drive isn’t 
automatically redeployed.

An OSD service won’t mess with existing OSDs, so you don’t have to worry about 
applying a spec that differs from the existing OSDs.  Use the —dry-run flag 
before applying a new spec to ensure that the effect is what you want.

> 
> The remaining puzzle is how to use the SSD as DB+WAL for more than one OSD. 
> At this point, the SSD is the raw device — I haven't done anything with it 
> manually in LVM or whatever. In the service spec description above, I see 
> that there is a "db_slots" key. So, I suppose that I could specify the 
> "whole" SSD and provide for the number of slots?

That’s the idea.

> However, I don't necessarily want every slot to be the same size (because of 
> my unfortunately heterogeneous hardware).

You can have multiple osd specs.  In the docs scroll down to the advanced osd 
spec section for examples.  You can constrain multiple specs 

There’s a lot of flexibility.  The slots on a given osd host will be the same 
size, though, right?

Here’s an example of a cluster being retrofitted.  First a couple of new nodes 
were added with SATA/SAS SSDs for HDD WAL+DB offload, then additional HDDs and 
offload SSDs were added to the existing nodes along with a couple of NVMe SSDs 
to be used only for CephFS metadata, not for offload.

# This describes the prior strategy.  Note the use of the `size` and 
`rotational` attributes to prevent the smaller SSDs from being used
# as OSDs.  This host_pattern started out as * and was whittled down as each 
host was migrated
---
service_type: osd
service_id: cost_capacity
service_name: osd.cost_capacity
placement:
  host_pattern: host24 
spec:
  data_devices:
    rotational: 1
    size: '18T:'
  filter_logic: AND
  objectstore: bluestore
—
# This spec matches only the 1.9TB NVMe SSDs to be
# used as OSDs for the CephFS metadata pool
# Here again we use `rotational` and `size` to constrain application
# since the WAL+DB offload SSDs are 2TB

service_type: osd
service_id: dashboard-admin-1705602677615
service_name: osd.dashboard-admin-1705602677615
placement:
  host_pattern: `*`
spec:
  data_devices:
    rotational: 0
    size: 490G:1200G
  filter_logic: AND
  objectstore: bluestore
—
# Here hybrid OSDs are deployed on specific
# device names, which isn’t ideal in general because
# the names may change

service_type: osd
service_id: osd.hybrid
service_name: osd.osd.hybrid
unmanaged: true
placement:
  hosts:
  - host1701
spec:
  block_db_size: 384075772723
  data_devices:
    paths:
    - /dev/sdc
    - /dev/sdd
    - /dev/sde
    - /dev/sdf
    - /dev/sdg
  db_devices:
    paths:
    - /dev/sdac
  db_slots: 5
  filter_logic: AND
  objectstore: bluestore
---


> So, I also see that there is a "block_db_size" and "block_wal_size". But it's 
> unclear how this relates to "db_slots" —  which one would determine how the 
> SSD is sliced up?

In general ignore the WAL and it’ll default to riding along with the DB.  In 
theory db_size and db_slots might be either/or, or perhaps there might be an 
SSD that services both for offload and other purposes, though I wouldn’t 
recommend that.  

> I'd actually be happy to pre-slice the SSD (e.g. with  LVM) and then directly 
> specify which SSD slice is the DB+WAL for which OSD, if that's a feasible 
> approach.

That works.  Here’s a one-off script I’ve used to do this.  I know, no error 
checking.  The args are 5x HDD block devices of existing OSDs followed by an 
offload device.  20% of the device is used for each offload.


#!/bin/bash

ceph osd set noscrub
ceph osd set nodeep-scrub
sleep 15

VG=`uname -n`-$6-db
vgcreate $VG /dev/$6

for i in $1 $2 $3 $4 $5 ; do lvcreate -l 20%VG -n ceph-osd$i-db $VG ; done

vgdisplay $VG

for i in $1 $2 $3 $4 $5 ; do ceph osd add-noout $i ; done

CFSID=$(ceph fsid)

for i in $1 $2 $3 $4 $5
do
    systemctl stop ceph-$CFSID@osd.$i
    date
    echo ceph-volume lvm  new-db --osd-id $i --osd-fsid $(ceph osd find $i | jq 
-r .osd_fsid) --target $VG/ceph-osd$i-db \; exit | cephadm shell --name osd.$i
    date
    echo ceph-volume lvm migrate --osd-id $i --osd-fsid $(ceph osd find $i | jq 
-r .osd_fsid) --target $VG/ceph-osd$i-db --from data \; exit | cephadm shell 
--name osd.$i
    date
    systemctl start ceph-$CFSID@osd.$i
done


> Though, I'd still be interested in knowing whether I need to set something 
> for "block_db_size" and "block_wal_size", or whether it's enough to just 
> actually make a certain size of LVM volume available for DB+WAL.

If you pre-create, then you don’t need the size params.

> 
> Normally I'd just experiment, but that might be disruptive to the working 
> cluster. I guess I could at least turn off rebalancing while I try things out?

Yes, turn off rebalancing so that you can validate the results, in case you 
have to zap and start over.  And use —dry-run a lot, and leave the osd 
service(s) unmanaged except when you’re actively using.

> 
> The other documentation I'm now reading is the documentation for ceph-volume, 
> which appears to be related:
> 
> https://docs.ceph.com/en/squid/ceph-volume/lvm/batch/
> 
> It mentions, for instance, things like db_slots and block_db_size. The 
> implication is that db_slots is an alternative to block_db_size — that you 
> wouldn't specify both, for instance.

In most cases I would agree.

> 
> I'm also reading the ceph-volume docs for "prepare". I suppose if I find that 
> more suitable, it might be possible to "prepare" and OSD with ceph-volume and 
> then "adopt" it with cephadm?

There might be snags with that approach.  Adoption I think is intended for 
legacy OSDs.

> 
> Well, just writing the email has given me a bit more clarity about things to 
> try, but I'd certainly be happy for any guidance.
> 
> 
> Ryan Rempel
> 
> Director of Information Technology
> 
> Canadian Mennonite University
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Adding OSD with separate DB via "ceph orch daemon add osd"

Reply via email to