Hi,
this is kind of a follow-up on a two years old thread [0] and I wanted
to raise some awareness for the corresponding tracker [1].
Back then we managed to limit the impact on the mon store performance
with some paxos configs, but now the OSDs are impacted as well when
there are new ones created:
Each newly created OSD process grows in RAM usage to around 140 GB
within a few minutes, easily causing oom killers on hosts if multiple
OSDs are created at once. The residual RAM usage drops to the memory
target after it successfully booted. The reason are the purged_snaps
that are loaded during OSD boot (snap_mapper.record_purged_snaps
purged_snaps), two years ago the customer had more than 42 million
purged_snap entries. I don't know how many there are today, I don't
have access myself, but I'll try to get a current number.
Anyway, the only way to safely create OSDs is one by one, maybe two at
once depending on the host's RAM capacity. An automated (unattended)
OSD deployment is currently not possible.
Unfortunately, the new tool [2] doesn't seem to work as expected, at
least in my test cluster it didn't have any impact on the number of
purged_snaps in the mon store. That's why we haven't tried it on the
customer cluster(s) yet.
How do other operators/admins/users deal with this kind of scenario?
Having many snapshots can't be a corner case, but I can't remember
having read anything like this on the list(s). I'd appreciate any
comments, although I'm aware that everybody is probably busy
travelling to Cephalocon. ;-)
Thanks!
Eugen
[0]
https://lists.ceph.io/hyperkitty/list/[email protected]/thread/ZEMGKBLMEREBZB7SWOLDA6QZX3S7FLL3/#ZEMGKBLMEREBZB7SWOLDA6QZX3S7FLL3
[1] https://tracker.ceph.com/issues/64519
[2] https://github.com/ceph/ceph/pull/57548
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]