Hello,
As an update: we were able to clear the queue by repeering all PGs
which had outstanding entries in their snaptrim queues. After this
process completed and we confirmed that no PGs remained with non-zero
length queues, we re-enabled our snapshot schedule. Several days have
now passed and
Hi,
Yes, restarting an OSD also works to re-peer and "kick" the
snaptrimming process.
(In the ticket we first noticed this because snap trimming restarted
after an unrelated OSD crashed/restarted).
Please feel free to add your experience to that ticket.
> monitoring snaptrimq
This is from our lo
Dan,
Thank you for replying. Since I posted I did some more digging. It
really seemed as if snaptrim simply wasn't being processed. The output
of "ceph health detail" showed that PG 3.9b had the longest queue. I
examined this PG and saw that it's primary was osd.8 so I manually
restarted that da
Hi David,
We observed the same here: https://tracker.ceph.com/issues/52026
You can poke the trimming by repeering the PGs.
Also, depending on your hardware, the defaults for osd_snap_trim_sleep
might be far too conservative.
We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db O