Hi Venky Thanks for your answer No: we are not using snapshots
Regards, Massimo On Wed, Oct 29, 2025 at 3:05 PM Venky Shankar <[email protected]> wrote: > Hi Massimo, > > On Tue, Oct 28, 2025 at 5:30 PM Massimo Sgaravatto > <[email protected]> wrote: > > > > Dear all > > > > We have a portion of a Cephfs file system that maps to a ceph pool called > > cephfs_data_ssd. > > > > > > If I perform a "du -sh" on this portion of the file system, I see that > the > > value matches the "STORED" field of the "ceph df" output for the > > cephfs_data_ssd pool. > > > > So far so good. > > > > I set a quota of 4.5 TB for this file system area. > > > > > > During the weekend, this pool (and other pools of the same device class) > > became nearfull. > > > > A "ceph df" showed that the problem was indeed in the the cephfs_data_ssd > > pool, with a reported usage of 7 TiB of data (21 TiB in replica 3): > > > > cephfs_data_ssd 62 32 7.1 TiB 2.02M 21 TiB 89.06 898 GiB > > > > > > This sounds strange to me because I set a quota of 4.5 TB in that area, > and > > because a "du -sh" of the relevant directory showed a usage of 600 GB. > > > > > > When I lowered the disk quota from 4.5 TB to 600 GB, the jobs writing in > > that > > area failed (because of disk quota exceeded) and after a while the space > was > > released. > > > > > > The only explanation I can think of is that, as far as I understand, > > cephfs can take a while to release the space for deleted files > > (https://docs.ceph.com/en/reef/dev/delayed-delete/). > > > > > > This would also be consistent with the fact that it looks like some jobs > > were performing a lot of writes and deletions (they kept writing a ~ 5GB > > checkpoint file, and deleting the previous one after each iteration). > > That's likely what is causing the high pool usage -- the files are > logically gone (du doesn't see them), but the objects are still lying > in the data pool consuming space which aren't getting deleted by the > purge queue in the MDS for some reason. Do you use snapshots? > > > > > > > > > How can I understand from the log files if this was indeed the problem ? > > > > Or do you have some other possible explanations for this problem ? > > > > And, most important, how can I prevent scenarios such as this one ? > > > > Thanks, Massimo > > _______________________________________________ > > ceph-users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > > > > -- > Cheers, > Venky > > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
