On Thu, Jul 18, 2024 at 12:13 PM Oleg Drokin <gr...@whamcloud.com> wrote: > did the pytorch job complete or was it still killed externally?
the pytorch job did not complete. the behavior we see is that the log entries stop and then 30 mins later a pytorch watchdog wakes up an exits all the process. yesterday we killed the job through slurm after the lctl debug_kernel process hung. > Is there an eviction in the log? there's nothing 'grep -i evict' in the lustre_debug logs or from the storage console logs > perhaps see continuity of the timestamps and what happened right before > and right after the gap if there is one in the times? > i pulled a counter from the logs of the functions calls, maybe one of these looks off (this is just the ones over 100k), please excuse typos $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) | sort | uniq -c | sort -n 105034 lov_io_init 105034 vvp_io_init 105035 lov_io_iter_init 105035 lob_strip_intersects 105035 osc_cache_writeback_range 105043 vvp_io_fini 105050 lov_conf_freeze 105050 lov_conf_thaw 294806 osc_attr-update 294806 osc_page_touch_at 294814 osc_consume_write_grant 294815 lov_attr_get_composite 294816 osc_enter_cache_try 351044 ll_write_end 589549 osc_queueu_async_io 589617 lov_merge_lvm_kms _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org