On Thu, Jul 18, 2024 at 12:13 PM Oleg Drokin <gr...@whamcloud.com> wrote:
> did the pytorch job complete or was it still killed externally?

the pytorch job did not complete.  the behavior we see is that the log
entries stop and then 30 mins later a pytorch watchdog wakes up an
exits all the process.  yesterday we killed the job through slurm
after the lctl debug_kernel process hung.

> Is there an eviction in the log?

there's nothing 'grep -i evict' in the lustre_debug logs or from the
storage console logs

> perhaps see continuity of the timestamps and what happened right before
> and right after the gap if there is one in the times?
>

i pulled a counter from the logs of the functions calls, maybe one of
these looks off (this is just the ones over 100k),  please excuse
typos

$ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) | sort
| uniq -c | sort -n
105034 lov_io_init
105034 vvp_io_init
105035 lov_io_iter_init
105035 lob_strip_intersects
105035 osc_cache_writeback_range
105043 vvp_io_fini
105050 lov_conf_freeze
105050 lov_conf_thaw
294806 osc_attr-update
294806 osc_page_touch_at
294814 osc_consume_write_grant
294815 lov_attr_get_composite
294816 osc_enter_cache_try
351044 ll_write_end
589549 osc_queueu_async_io
589617 lov_merge_lvm_kms
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to