[ceph-users] Re: nodes with high density of OSDs

Anthony D'Atri Fri, 11 Apr 2025 07:50:56 -0700


> I think one of the scariest things about your setup is that there are only 4 
> nodes (I'm assuming that means Ceph hosts carrying OSDs). I've been bouncing 
> around different configurations lately between some of my deployment issues 
> and cranky old hardware and I presently am down to 4 hosts with 1-2 OSDs per 
> host. If even one of those hosts goes down, Ceph gets unhappy. If 2 are 
> offline at once, Ceph goes into self-defense mode. I'd hate to think of 116 
> OSDs at risk on a single host.

My sense is that from a cluster perspective it’s not so much a function of the 
absolute number of OSDs that go down as the percentage of the cluster that a 
host represents.  If a cluster comprises 20x hosts each with 116 OSDs, one 
going down is only 5% of the whole.

One of the concerns is maintaining enough space to recover that many OSDs’ 
worth of data, if mon_osd_down_out_subtree_limit is not used to forestall most 
whole-host recovery.

> 
> I got curious about when LVM comes online, and I believe that the vgchange 
> command that activates the LVs is actually in the initrd file before systemd 
> comes up if a system was configured for LVM support. That's necessary, in 
> fact, since the live root partition can be and often is an LV itself.
> 
> As for for systemd dependencies, that's something I've been doing a lot of 
> tuning on myself, as things like my backup system won't work if certain 
> volumes aren't mounted, so I've had to add "RequiresVolume" dependencies, 
> plus some daemons require other daemons. So it's an interesting dance.
> 
> At this point I think that the best way to ensure that all LVs are online 
> would be to add overrides under /etc/systemd/system/ceph.service (probably 
> needs the fsid in the service name, too). Include a beforeStartup command 
> that scans the proc ps list and loops until the vgscan process no longer show 
> up (command completed).

I was thinking ExecStartPre=/bin/sleep 60 or so as an override to keep it 
simple, but feel free to get surgical.  With of course Ansible or other 
automation to persist the override for new/updated/changed hosts.

> But I really would reconsider both your host and OSD count. Larger OSDs and 
> more hosts would give better reliability and performance.

Indeed.  If such a chassis is picked due to perceived cost savings over all 
else, there is the cost of not doing the job, but moreover having only 4 
prevents the use of a reasonably wide EC profile, which probably costs more in 
Capex than having a larger number of more conventional chassis.

I think we haven’t seen the OP’s drive size, but I’ll bet that they’re already 
at least 20TB HDDs, with the usual SATA bottleneck.  Ultradense toploaders can 
also exhibit HBA and backplane saturation.

I think you may have meant “Fewer OSDs per host and more hosts”, Sir Enchanter.

> 
>    Tim
> 
> On 4/11/25 03:53, Alex from North wrote:
>> Hello Tim! First of all, thanks for the detailed answer!
>> Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but 
>> what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it 
>> seems to be not that dramatic, no?
>> 
>> However, in a docu I see that it is quite common for systemd to fail on boot 
>> and even showed a way to escape.
>> 
>> ```
>> It is common to have failures when a system is coming up online. The devices 
>> are sometimes not fully available and this unpredictable behavior may cause 
>> an OSD to not be ready to be used.
>> 
>> There are two configurable environment variables used to set the retry 
>> behavior:
>> 
>> CEPH_VOLUME_SYSTEMD_TRIES: Defaults to 30
>> 
>> CEPH_VOLUME_SYSTEMD_INTERVAL: Defaults to 5
>> ```
>> 
>> But if where should I set these vars? If I set it as ENV vars in bashrc of 
>> root it doesnt seem to work as ceph starts at the boot time when root env 
>> vars are not active yet...
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: nodes with high density of OSDs

Reply via email to