Hello,

   I'm tracking a very vexing issue.  Somehow users are creating files that 
cause attempts to examine or manipulate the files to hang.  They can't be 
removed, they can't be examined, even an ls command will hang.

an strace on an ls command run against some of these files produced the 
following:

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_124.out",
 "system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)

lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
 {st_mode=S_IFREG|0644, st_size=16979, ...}) = 0

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
 "system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available)

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
 "system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)

lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_000.out",

On one client that was being used to debug this problem I see the following 
(running CentOS 6.5 and luster 2.5.42)
Lustre: 2753:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for slow reply: [sent 1585677903/real 1585677903]  
req@ffff880238fdd000 x1657754295856608/t0(0) 
o101->[email protected]@o2ib:28/4 lens 328/400 
e 1 to 1 dl 1585678504 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection to lustre19-OST0028 
(at 172.17.0.99@o2ib) was lost; in progress operations using this service will 
wait for recovery to complete
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection restored to 
lustre19-OST0028 (at 172.17.0.99@o2ib)
Lustre: 2733:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 
465 > total measured time 96
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) @@@ Wrong buffer for 
field `dlm_rep' (1 of 1) in format `LDLM_ENQUEUE_LVB': 0 vs. 112 (server)
  req@ffff8802149c5800 x1657754295862208/t0(0) 
o101->[email protected]@o2ib:28/4 lens 328/192 
e 0 to 0 dl 1585679147 ref 1 fl Interpret:R/2/0 rc 0/0
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) Skipped 1 previous 
similar message

On newer systems (RHEL 7.7 running 2.10.8-1)  the problem also occurs and I see 
the following in dmesg
[5437818.727792] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection 
restored to 172.17.0.99@o2ib (at 172.17.0.99@o2ib)
[5438419.769959] Lustre: 2747:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ 
Request sent has timed out for slow reply: [sent 1585678345/real 1585678345]  
req@ffff884834ec0600 x1657002677904272/t0(0) 
o101->[email protected]@o2ib:28/4 lens 344/400 
e 24 to 1 dl 1585678946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
[5438419.769973] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection to 
lustre19-OST0028 (at 172.17.0.99@o2ib) was lost; in progress operations using 
this service will wait for recovery to complete
[5438419.770435] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection 
restored to 172.17.0.99@o2ib (at 172.17.0.99@o2ib)

df and lfs df commands on these systems aren't hanging.

on the OSS we are running: lustre 2.12.1-1 and CentOS 7.6) I'm seeing the 
following:
[Mar31 14:26] LNet: Service thread pid 309349 was inactive for 1121.43s. The 
thread might be hung, or it might only be slow and will resume later. Dumping 
the stack trace for debugging purposes:
[  +0.000007] Pid: 309349, comm: ll_ost00_088 3.10.0-957.10.1.el7_lustre.x86_64 
#1 SMP Tue Apr 30 22:18:15 UTC 2019
[  +0.000002] Call Trace:
[  +0.000014]  [<ffffffffc130d560>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
[  +0.000074]  [<ffffffffc12cafe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[  +0.000039]  [<ffffffffc12ec54b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
[  +0.000045]  [<ffffffffc16a63fb>] ofd_intent_policy+0x69b/0x920 [ofd]
[  +0.000015]  [<ffffffffc12cbd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[  +0.000038]  [<ffffffffc12f4587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[  +0.000046]  [<ffffffffc137c502>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  +0.000072]  [<ffffffffc1380e5a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[  +0.000065]  [<ffffffffc132680b>] ptlrpc_server_handle_request+0x24b/0xab0 
[ptlrpc]
[  +0.000056]  [<ffffffffc132a13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[  +0.000054]  [<ffffffff8dac1c71>] kthread+0xd1/0xe0
[  +0.000007]  [<ffffffff8e175c1d>] ret_from_fork_nospec_begin+0x7/0x21
[  +0.000009]  [<ffffffffffffffff>] 0xffffffffffffffff
[  +0.000036] LustreError: dumping log to /tmp/lustre-log.1585679216.309349

Thank you,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to