Summary:
When creating a directory (mkdir), lustre does not “sync” by default when there 
is a single mdt.  With multiple mdts where the child directory is a created on 
a different mdt than the parent (cross mdt mkdir), lustre does an osd_sync, 
which we suspect is for atomicity. Our experiments show that if we disable the 
osd_sync in the cross-mdt case, we don’t lose atomicity and system recovers if 
any one of the 3 hosts involved is available (similar to the single mdt case) 
So, we are wondering if this “osd-sync” is needed in the cross-mdt case, as the 
call to sync degrades performance.

Issue:
In a Lustre Distributed Namespace Environment (DNE) featuring multiple Metadata 
Targets (MDTs), the process of creating remote directories is notably slower 
compared to a single MDT file system utilizing the osd-zfs backend.

This performance issue can be consistently replicated using a single client, 
specifically by creating approximately 1000 child directories with the command 
lfs mkdir -i 1 . The parent directory is part of MDT-0, while the child 
directories are created on MDT-1, following a pattern such as /parent/child-0, 
/parent/child-1, etc.

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~0.9 sec and
  *   Creating 1000 child directories on remote MDT  (parent directory on MDT0, 
and child directory on MDT1) takes ~12 sec
Testing using mdtest with mpirun involving two clients and 50 iterations,  
directories are generated in a round-robin fashion to utilize both MDTs, as 
demonstrated by the command "mpirun -mca routed direct -map-by node -np 16 
mdtest -n 625 -i 50 -u -d /lfs/mdtest".


A
B
C
D
1
Operation
Directory Operations/Sec
2

With Single MDT
With 2 MDTs
Performance degradation percentage
3
Directory creation
17260.653
856.898
95.04

Probable Cause:
The creation of a child directory on the same MDT as the parent does not force 
a osd_sync.
The creation of a child directory on a different MDT than the parent triggers 
an osd_sync of the parent directory.

The directory creation process first checks and cancels the parent directory 
lock that was previously acquired during a different operation. If the lock was 
established as part of the previous remote directory creation, it was done so 
in a protected write mode, necessitating a flush of the underlying directory. 
However, this cancellation process enforces a synchronization of the underlying 
parent Metadata Target (MDT) device.

The conditions for enforcing the synchronization path are as follows:

  *   LDLM_CB_CANCELING and BLOCKING_SYNC_ON_CANCEL
  *   l_granted_mode is one of (LCK_EX | LCK_PW | LCK_GROUP)
  *   OBD_CONNECT_MDS_MDS bit set in l_export
Corresponding code links

  *    Link to check the above conditions at  
https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1336-L1342
  *   The path that invokes the synchronization is at 
https://github.com/lustre/lustre-release/blob/master/lustre/target/tgt_handler.c#L1381-L1394,
  provided that the locks are not taken with the LDLM_STRIPE option.
  *   This entire device synchronization is enforced device sync is at 
https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1288

Experiment:
I did an experiment where I skipped the osd_sync on directory create, and saw 
the following results:

Using a single client, specifically by creating approximately 1000 child 
directories with the command lfs mkdir -i 1 .

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~1.6 sec and
  *   Creating 1000 child directories on remote MDT  (Child directory on MDT1) 
takes ~3.8 sec

Same test using mdtest with mpirun results:


A
B
C
D
1
Operation
Directory Operations/Sec on DNE filesystem
2

Default
Without osd_sync
Performance improvement percentage
3
Directory creation
856.898
5659.511
560.46

We conducted crash testing with osd_sync disabled, specifically targeting 
remote directory creation, and observed the following outcomes:

Crash


Filesystem State
Client
MDT0
MDT1

Yes
No
No
Recovered, healthy and could verify the directory tree
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Lost some directory entries

We are trying to understand the implication of disabling osd_sync. The POSIX 
spec for mkdir does not explicitly require it to be synchronously durable 
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mkdir.html and 
pushes the burden to the user to call fsync.

We do though need mkdir to be atomic and not leave partial directory artifacts 
on one mdt and not another. This is the part where we would like to understand 
from you if we are breaking the concurrency behavior here.

Proposed Change

diff --git a/lustre/target/tgt_handler.c b/lustre/target/tgt_handler.c
index 33b9863bdc..80948b5f7a 100644
--- a/lustre/target/tgt_handler.c
+++ b/lustre/target/tgt_handler.c
@@ -1333,12 +1333,17 @@ static int tgt_blocking_ast(struct ldlm_lock *lock, 
struct ldlm_lock_desc *desc,
                RETURN(-EINVAL);
        }

+       //
+       //Proposed Change:
+       //Skip the tgt_sync if the corrspoinding operation is across OSDS and 
inode is being updated under IBITS lock
+       //
        if (flag == LDLM_CB_CANCELING &&
            (lock->l_granted_mode & (LCK_EX | LCK_PW | LCK_GROUP)) &&
            (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_ALWAYS ||
             (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_BLOCKING &&
              ldlm_is_cbpending(lock))) &&
-           ((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) ||
+           (((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) &&
+              lock->l_resource->lr_type != LDLM_IBITS) ||
             lock->l_resource->lr_type == LDLM_EXTENT)) {
                __u64 start = 0;
                __u64 end = OBD_OBJECT_EOF;

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to