Hi all, we've been seeing persistent problems when trying to create hardlinks 
on cephfs; it's returning EXDEV in a way that makes no sense given typical 
POSIX behaviour and ceph documentation. Here's a typical strace of the problem:

        78    13:47:26.572435 
link("/data/db/hdb/data/2023.08.06/table1.0/column1", 
"/data/db/hdb/data/2023.08.06/table1.1/column1") = -1 EXDEV (Invalid 
cross-device link)
        78    13:47:26.577661 write(1, 
"{\"time\":\"2025-03-03T13:47:26.577z\",\"component\":\"MSVC\",\"level\":\"INFO\",\"message\":\"[eoi-78]
 Retrying in 500 milliseconds\",\"service\":\"eoi\"}\n", 136) = 136
        78    13:47:26.577762 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, 
tv_nsec=500000000}, NULL) = 0
        78    13:47:27.078037 
link("/data/db/hdb/data/2023.08.06/table1.0/column1", 
"/data/db/hdb/data/2023.08.06/table1.1/column1") = 0

We try creating a link, get EXDEV, wait 500 milliseconds, then try the same 
operation again and it succeeds. The link and its target are both on the same 
cephfs mount (/data/db/hdb in this case), so the normal POSIX 'linking between 
filesystems' explanation doesn't apply.  I've looked through the ceph client 
and server code and from what I've seen EXDEV is only returned in a couple of 
other situations: linking between snapshots, and linking across quotas. Neither 
snapshots nor quotas were in use here, and if they were the culprit it seems 
unlikely the automatic retry would have worked. Web searches on EXDEV errors in 
ceph have also proven to be a dead end. My best guess, although it's not a very 
good one, is that stale MDS cache data is somehow involved -- in one case the 
issue reportedly got much worse after increasing (!) the MDS memory limit.

This error has been occurring for a particular client for upwards of 9 months 
and has proven stubbornly resistant to reproduction elsewhere (we are working 
on migrating them to a more recent ceph version to see if the error remains), 
so our technical investigations haven't got particularly far. I was hoping 
someone here on ceph-users would have seen similar EXDEV errors in the wild or 
in development and have some insight into what could be causing them.

Regards, Domhnall
***********************************************************************************************************************************************************************
This email, its contents and any files attached are a confidential 
communication and are intended only for the named addressees indicated in the 
message. If you are not the named addressee or if you have received this email 
in error, you may not, without the consent of KX, copy, use or rely on any 
information or attachments in any way. Please notify the sender by return email 
and delete it from your email system.
Unless separately agreed, KX does not accept any responsibility for the 
accuracy or completeness of the contents of this email or its attachments. 
Please note that any views, opinion or advice contained in this communication 
are those of the sending individual and not those of KX and KX shall have no 
liability whatsoever in relation to this communication (or its content) unless 
separately agreed.
***********************************************************************************************************************************************************************
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to