Hi Venky
Thanks for your answer

No: we are not using snapshots

Regards, Massimo

On Wed, Oct 29, 2025 at 3:05 PM Venky Shankar <[email protected]> wrote:

> Hi Massimo,
>
> On Tue, Oct 28, 2025 at 5:30 PM Massimo Sgaravatto
> <[email protected]> wrote:
> >
> > Dear all
> >
> > We have a portion of a Cephfs file system that maps to a ceph pool called
> > cephfs_data_ssd.
> >
> >
> > If I perform a "du -sh" on this portion of the file system, I see that
> the
> > value matches the "STORED" field of the "ceph df" output for the
> > cephfs_data_ssd pool.
> >
> > So far so good.
> >
> > I set a quota of 4.5 TB for this file system area.
> >
> >
> > During the weekend, this pool (and other pools of the same device class)
> > became nearfull.
> >
> > A "ceph df" showed that the problem was indeed in the the cephfs_data_ssd
> > pool, with a reported usage of 7 TiB of data (21 TiB in replica 3):
> >
> > cephfs_data_ssd 62 32 7.1 TiB 2.02M 21 TiB 89.06 898 GiB
> >
> >
> > This sounds strange to me because I set a quota of 4.5 TB in that area,
> and
> > because a "du -sh" of the relevant directory showed a usage of 600 GB.
> >
> >
> > When I lowered the disk quota from 4.5 TB to 600 GB, the jobs writing in
> > that
> > area failed (because of disk quota exceeded) and after a while the space
> was
> > released.
> >
> >
> > The only explanation I can think of is that, as far as I understand,
> > cephfs can take a while to release the space for deleted files
> > (https://docs.ceph.com/en/reef/dev/delayed-delete/).
> >
> >
> > This would also be consistent with the fact that it looks like some jobs
> > were performing a lot of writes and deletions (they kept writing a ~ 5GB
> > checkpoint file, and deleting the previous one after each iteration).
>
> That's likely what is causing the high pool usage -- the files are
> logically gone (du doesn't see them), but the objects are still lying
> in the data pool consuming space which aren't getting deleted by the
> purge queue in the MDS for some reason. Do you use snapshots?
>
> >
> >
> >
> > How can I understand from the log files if this was indeed the problem ?
> >
> > Or do you have some other possible explanations for this problem ?
> >
> > And, most important, how can I prevent scenarios such as this one ?
> >
> > Thanks, Massimo
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> >
>
>
> --
> Cheers,
> Venky
>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to