Dear all We have a portion of a Cephfs file system that maps to a ceph pool called cephfs_data_ssd.
If I perform a "du -sh" on this portion of the file system, I see that the value matches the "STORED" field of the "ceph df" output for the cephfs_data_ssd pool. So far so good. I set a quota of 4.5 TB for this file system area. During the weekend, this pool (and other pools of the same device class) became nearfull. A "ceph df" showed that the problem was indeed in the the cephfs_data_ssd pool, with a reported usage of 7 TiB of data (21 TiB in replica 3): cephfs_data_ssd 62 32 7.1 TiB 2.02M 21 TiB 89.06 898 GiB This sounds strange to me because I set a quota of 4.5 TB in that area, and because a "du -sh" of the relevant directory showed a usage of 600 GB. When I lowered the disk quota from 4.5 TB to 600 GB, the jobs writing in that area failed (because of disk quota exceeded) and after a while the space was released. The only explanation I can think of is that, as far as I understand, cephfs can take a while to release the space for deleted files (https://docs.ceph.com/en/reef/dev/delayed-delete/). This would also be consistent with the fact that it looks like some jobs were performing a lot of writes and deletions (they kept writing a ~ 5GB checkpoint file, and deleting the previous one after each iteration). How can I understand from the log files if this was indeed the problem ? Or do you have some other possible explanations for this problem ? And, most important, how can I prevent scenarios such as this one ? Thanks, Massimo _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
