Hi,

this is kind of a follow-up on a two years old thread [0] and I wanted to raise some awareness for the corresponding tracker [1].

Back then we managed to limit the impact on the mon store performance with some paxos configs, but now the OSDs are impacted as well when there are new ones created:

Each newly created OSD process grows in RAM usage to around 140 GB within a few minutes, easily causing oom killers on hosts if multiple OSDs are created at once. The residual RAM usage drops to the memory target after it successfully booted. The reason are the purged_snaps that are loaded during OSD boot (snap_mapper.record_purged_snaps purged_snaps), two years ago the customer had more than 42 million purged_snap entries. I don't know how many there are today, I don't have access myself, but I'll try to get a current number. Anyway, the only way to safely create OSDs is one by one, maybe two at once depending on the host's RAM capacity. An automated (unattended) OSD deployment is currently not possible. Unfortunately, the new tool [2] doesn't seem to work as expected, at least in my test cluster it didn't have any impact on the number of purged_snaps in the mon store. That's why we haven't tried it on the customer cluster(s) yet.

How do other operators/admins/users deal with this kind of scenario? Having many snapshots can't be a corner case, but I can't remember having read anything like this on the list(s). I'd appreciate any comments, although I'm aware that everybody is probably busy travelling to Cephalocon. ;-)

Thanks!
Eugen

[0] https://lists.ceph.io/hyperkitty/list/[email protected]/thread/ZEMGKBLMEREBZB7SWOLDA6QZX3S7FLL3/#ZEMGKBLMEREBZB7SWOLDA6QZX3S7FLL3
[1] https://tracker.ceph.com/issues/64519
[2] https://github.com/ceph/ceph/pull/57548
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to