I think there's some more investigation required to get to the bottom of this. Do you by any chance have a snap schedule enabled which would create snapshots automatically? Do you see that many snapshots in 'rbd -p pool ls --long' output?

By the way, having millions of purged snapshots can have quite a heavy impact (https://lists.ceph.io/hyperkitty/list/[email protected]/thread/YRY2CGWSFHTEMXYPYL2CUGK6XOQDG3Z2/).

I'm still not sure what to think about the 100% used output, though.


Zitat von Eugen Block <[email protected]>:

I don’t have much time right now to look deeper, but I agree, the 100% used column is something to look into. That might either be the root cause for snaptrims not happening or it might play some role in it.

Zitat von Lukasz Gomulka <[email protected]>:

Hello!
Thank You for Your answer,
Right now we have problems with cluster on NVMe so DB naturally is also on NVMe. Last time we have issue on HDD+NVMe(for DB/WAL). Now we have mClock scheduler. Generally we are able to snaptrim but it cost CPU resources. We can snaptrim thousands of snapshots per day. But the main problem and question still remain. Why we got without any reason 2.5M of snaps to remove. What should we do. How to prevent. It is a chance that if 2.5M of snaps came in seconds then it is possible to remove it in few seconds. Some db corruption maybe easy to clean.
Current cluster:

version - 17.2.7
mclock scheduler

I attached also outputs from:

ceph -s
ceph osd df tree
ceph osd pool ls detail
ceph df


I have to add that in json there is no entry in removed_snaps but in normal output in removed_snaps_queue we can see:

pool 1 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 667587 lfor 0/0/1211 > flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.5 application rbd
    removed_snaps_queue [58ae2~3,58af4~3,58cd5~5,58cdd~3
   ... million+ ...
c59ae~3,4c59b6~3,4c59bc~3]

Similar malformed output is in attached `ceph df`. Pool of course have still available space but commands shows `0%`.

Best Regards,
Lukasz Lucki Gomulka

________________________________
From: Eugen Block <[email protected]>
Sent: 03 November 2025 11:53:06
To: [email protected]
Subject: [ceph-users] Re: Snaptrim flood

Hi,

are you using mclock scheduler (default in Quincy)? Until Reef 18.2.4
there was a default value set for osd_snap_trim_cost (1M bytes) which
blocked snaptrims [0]. This was fixed in [1] and backported to Reef.
But it's unlikely that this was your issue in Octopus as mclock became
the default in Pacific, IIRC. Since Quincy is also EOL, I'd recommed
to update further, if possible.
Were you able to avoid OSD flapping with the nodown flag (ceph osd set
nodown)? This can help to keep the cluster more stable in such
situations.
Can you add some more details about your setup like:

ceph -s
ceph osd df tree
ceph osd pool ls detail
ceph df

Are you using HDD OSDs or HDDs with dedicated DB/WAL? How many
snapshots are you generating?

Regards,
Eugen

[0] https://tracker.ceph.com/issues/67702
[1] https://tracker.ceph.com/issues/63604
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to