Hi,

I just noticed an interesting behavior that I didn't find documented
anywhere. Normally, CephFS stores backtraces in the both default data
pool and the actual data pool a file's layout points to (if different):

$ for pool in cephfs2_data cephfs2_data_hr3 cephfs2_data_hec5.4; do
        echo "# Pool: $pool"
        rados -p $pool listxattr 100032d5676.00000000
done
# Pool: cephfs2_data
parent
# Pool: cephfs2_data_hr3
error getting xattr set cephfs2_data_hr3/100032d5676.00000000: (2) No
such file or directory
# Pool: cephfs2_data_hec5.4
layout
parent

However, I found that some files have backtrace objects in another pool:

$ for pool in cephfs2_data cephfs2_data_hr3 cephfs2_data_hec5.4; do
        echo "# Pool: $pool"
        rados -p $pool listxattr 100032c464a.00000000
done
# Pool: cephfs2_data
parent
# Pool: cephfs2_data_hr3
parent
# Pool: cephfs2_data_hec5.4
layout
parent

Apparently, this happens when you create a file in a directory that is
assigned to a non-default pool, and then change the pool before writing
any data into the object. In other words, any pool that a file is *ever*
assigned to after being created empty, ends up with a backtrace object.
This is visible in the backtrace object itself:

rados -p cephfs2_data getxattr 100032c464a.00000000 parent \
        | head -c -1 \
        | ceph-dencoder type inode_backtrace_t import - decode dump_json

{
    "ino": 1099564861002,
    "ancestors": [
<snip>
    ],
    "pool": 14,
    "old_pools": [
        11,
        12
    ]
}

Here pool 11 is the default pool, pool 14 is the desired pool, and pool
12 is the pool of the directory layout for the directory the file was
created in (before being reassigned prior to writing data).

I found this rather surprising, as I'm migrating a large amount of data
to a new pool by setting file layouts before writing out data to the new
temporary file [1] and my temporary directory happened to be assigned to
a non-default pool, so now I have a bunch of garbage objects in that
pool (which happens to be an HDD pool, so this hurts performance). I
would have expected the directory's assigned layout pool to be
irrelevant if I override it after opening a new file for writing, but
apparently this is not the case.

I guess this is so existing clients can find the real pool if they have
the old pool cached or something? I'm not entirely sure why this
behavior exists.

It would be nice if there were some way to clean this up without having
to rewrite all the files again (it's terabytes of data). I don't mind if
I have to take the FS down for a bit to run a cleanup tool, but doing
manual surgery on the backtrace objects and inode omap values seems
pretty difficult and error-prone :/

[1] https://gist.github.com/marcan/26cc3ac7241f866dca38916215dd10ff

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to