On Thu, Jul 11, 2024 at 5:24 PM Oleg Drokin <gr...@whamcloud.com> wrote: > does it ever resume or does it stop-stop? If you have a hard stop after > which the thing is killed - how long is it? > Are the writes synchronous? an you collect lustre debug logs from one > of the clients with +vfstrace+cache+rpctrace+inode debug mask may be > when the hang happens?
small update on this. we attempted to take a trace today. we managed to whittle the process down to two nodes, here's the steps we took launch job (2 nodes allocated through slurm) tail the job log, seems to be starting up pdsh -w node[1-2] -l root 'lctl set_param debug_mb=512' pdsh -w node[1-2] -l root 'lctl set_param debug +vfstrace+cache+rpctrace+inode' pdsh -w node[1-2] -l root 'lctl debug clear' pdsh -w node[1-2] -l root 'lctl debug mark' job runs along for a few minutes, but eventually the log stops while the wall clock moves along at this point we pull the debug pdsh -w node[1-2] -l root 'lctl debug_kernel /lustre/temp/lustre_debug.${HOSTNAME}.`date +%s`' this is where things get a little weird node2 seemed to dump out 2.5mil lines of logfile and return (~400mb) node1 does not, it dumps out 28k worth of the log and then just hangs node1 is still up and responding normally as far as i can tell. no errors in dmesg and the filesystem still responds to normal commands. even though the node seems okay, the job is definitely stalled at this point we cancelled the job. i had to leave for the day, but i left the node in the broken state. i'll see if maybe something gets put in the logs or the kernel debug completes overnight, but seems unlikely. i know this is pretty far into left field and hard to debug at this point, but any suggestions? _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org