Hi, I just noticed an interesting behavior that I didn't find documented anywhere. Normally, CephFS stores backtraces in the both default data pool and the actual data pool a file's layout points to (if different):
$ for pool in cephfs2_data cephfs2_data_hr3 cephfs2_data_hec5.4; do echo "# Pool: $pool" rados -p $pool listxattr 100032d5676.00000000 done # Pool: cephfs2_data parent # Pool: cephfs2_data_hr3 error getting xattr set cephfs2_data_hr3/100032d5676.00000000: (2) No such file or directory # Pool: cephfs2_data_hec5.4 layout parent However, I found that some files have backtrace objects in another pool: $ for pool in cephfs2_data cephfs2_data_hr3 cephfs2_data_hec5.4; do echo "# Pool: $pool" rados -p $pool listxattr 100032c464a.00000000 done # Pool: cephfs2_data parent # Pool: cephfs2_data_hr3 parent # Pool: cephfs2_data_hec5.4 layout parent Apparently, this happens when you create a file in a directory that is assigned to a non-default pool, and then change the pool before writing any data into the object. In other words, any pool that a file is *ever* assigned to after being created empty, ends up with a backtrace object. This is visible in the backtrace object itself: rados -p cephfs2_data getxattr 100032c464a.00000000 parent \ | head -c -1 \ | ceph-dencoder type inode_backtrace_t import - decode dump_json { "ino": 1099564861002, "ancestors": [ <snip> ], "pool": 14, "old_pools": [ 11, 12 ] } Here pool 11 is the default pool, pool 14 is the desired pool, and pool 12 is the pool of the directory layout for the directory the file was created in (before being reassigned prior to writing data). I found this rather surprising, as I'm migrating a large amount of data to a new pool by setting file layouts before writing out data to the new temporary file [1] and my temporary directory happened to be assigned to a non-default pool, so now I have a bunch of garbage objects in that pool (which happens to be an HDD pool, so this hurts performance). I would have expected the directory's assigned layout pool to be irrelevant if I override it after opening a new file for writing, but apparently this is not the case. I guess this is so existing clients can find the real pool if they have the old pool cached or something? I'm not entirely sure why this behavior exists. It would be nice if there were some way to clean this up without having to rewrite all the files again (it's terabytes of data). I don't mind if I have to take the FS down for a bit to run a cleanup tool, but doing manual surgery on the backtrace objects and inode omap values seems pretty difficult and error-prone :/ [1] https://gist.github.com/marcan/26cc3ac7241f866dca38916215dd10ff - Hector _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io