Hi Alex,
I think one of the scariest things about your setup is that there are
only 4 nodes (I'm assuming that means Ceph hosts carrying OSDs). I've
been bouncing around different configurations lately between some of my
deployment issues and cranky old hardware and I presently am down to 4
hosts with 1-2 OSDs per host. If even one of those hosts goes down, Ceph
gets unhappy. If 2 are offline at once, Ceph goes into self-defense
mode. I'd hate to think of 116 OSDs at risk on a single host.
I got curious about when LVM comes online, and I believe that the
vgchange command that activates the LVs is actually in the initrd file
before systemd comes up if a system was configured for LVM support.
That's necessary, in fact, since the live root partition can be and
often is an LV itself.
As for for systemd dependencies, that's something I've been doing a lot
of tuning on myself, as things like my backup system won't work if
certain volumes aren't mounted, so I've had to add "RequiresVolume"
dependencies, plus some daemons require other daemons. So it's an
interesting dance.
At this point I think that the best way to ensure that all LVs are
online would be to add overrides under /etc/systemd/system/ceph.service
(probably needs the fsid in the service name, too). Include a
beforeStartup command that scans the proc ps list and loops until the
vgscan process no longer show up (command completed).
But I really would reconsider both your host and OSD count. Larger OSDs
and more hosts would give better reliability and performance.
Tim
On 4/11/25 03:53, Alex from North wrote:
Hello Tim! First of all, thanks for the detailed answer!
Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but
what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it
seems to be not that dramatic, no?
However, in a docu I see that it is quite common for systemd to fail on boot
and even showed a way to escape.
```
It is common to have failures when a system is coming up online. The devices
are sometimes not fully available and this unpredictable behavior may cause an
OSD to not be ready to be used.
There are two configurable environment variables used to set the retry behavior:
CEPH_VOLUME_SYSTEMD_TRIES: Defaults to 30
CEPH_VOLUME_SYSTEMD_INTERVAL: Defaults to 5
```
But if where should I set these vars? If I set it as ENV vars in bashrc of root
it doesnt seem to work as ceph starts at the boot time when root env vars are
not active yet...
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io