On Thu, Jul 11, 2024 at 5:24 PM Oleg Drokin <gr...@whamcloud.com> wrote:
> does it ever resume or does it stop-stop? If you have a hard stop after
> which the thing is killed - how long is it?
> Are the writes synchronous? an you collect lustre debug logs from one
> of the clients with +vfstrace+cache+rpctrace+inode debug mask may be
> when the hang happens?

small update on this.  we attempted to take a trace today.  we managed
to whittle the process down to two nodes, here's the steps we took

launch job (2 nodes allocated through slurm)
tail the job log, seems to be starting up
pdsh -w node[1-2] -l root 'lctl set_param debug_mb=512'
pdsh -w node[1-2] -l root 'lctl set_param debug +vfstrace+cache+rpctrace+inode'
pdsh -w node[1-2] -l root 'lctl debug clear'
pdsh -w node[1-2] -l root 'lctl debug mark'
job runs along for a few minutes, but eventually the log stops while
the wall clock moves along

at this point we pull the debug
pdsh -w node[1-2] -l root 'lctl debug_kernel
/lustre/temp/lustre_debug.${HOSTNAME}.`date +%s`'

this is where things get a little weird
node2 seemed to dump out 2.5mil lines of logfile and return (~400mb)
node1 does not, it dumps out 28k worth of the log and then just hangs

node1 is still up and responding normally as far as i can tell.  no
errors in dmesg and the filesystem still responds to normal commands.
even though the node seems okay, the job is definitely stalled

at this point we cancelled the job.  i had to leave for the day, but i
left the node in the broken state.  i'll see if maybe something gets
put in the logs or the kernel debug completes overnight, but seems
unlikely.  i know this is pretty far into left field and hard to debug
at this point, but any suggestions?
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to