interestingly enough i ran cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c
and while the timeline doesn't show a big gap, the number of records per time interval does. i can clearly see the slope of the job and exactly where it seems to have fallen off i might be able to get the small chunk of the log exported, but i'll paraphrase the log a little below starting with many entries repeat that repeat using 4096 grant cretis for brw osc grant stripe kms increasing set kms from mdt id initial value mdt id on ost comp id #-1 obj ready osc grant then it stops and this comes out ll_mert_attr update i_sze bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt obj ready update pending cmd advancing bytes range unlock vvp_io_fini() bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt writing x of x to x bytes pos len copied stripe kms increasing set kms from ll_do_tiny_write() to take share lv owner to release share lov owner vfs op:onoe to take share_lov to_release share lov lock match for FLD lookup got mds 1 fld lookup got mds CLOSE mdc file closed with intent 200 @@@ matched open mdc mdc modify rpc slot 1 @@@ reassign xid sending RPC set going to sleep for 11 seconds completed rpc fld lookup got mds 1 @@@ free open request after the last command, it then spirals out repeating until the end of the log vvp_io_init lov_io_init lov_conf_freeze to take share lov_conf_thaw to release share On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <gr...@whamcloud.com> wrote: > > On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote: > > there's nothing 'grep -i evict' in the lustre_debug logs or from the > > storage console logs > > hm, ok. > > > > perhaps see continuity of the timestamps and what happened right > > > before > > > and right after the gap if there is one in the times? > > > > > > > i pulled a counter from the logs of the functions calls, maybe one of > > these looks off (this is just the ones over 100k), please excuse > > typos > > > > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) | > > sort > > > uniq -c | sort -n > > 105034 lov_io_init > > 105034 vvp_io_init > > 105035 lov_io_iter_init > > 105035 lob_strip_intersects > > 105035 osc_cache_writeback_range > > 105043 vvp_io_fini > > 105050 lov_conf_freeze > > 105050 lov_conf_thaw > > 294806 osc_attr-update > > 294806 osc_page_touch_at > > 294814 osc_consume_write_grant > > 294815 lov_attr_get_composite > > 294816 osc_enter_cache_try > > 351044 ll_write_end > > 589549 osc_queueu_async_io > > 589617 lov_merge_lvm_kms > > that's not really going to do anything useful, there's a timestamp in > unix time as the fourth field (separated with colons), see if there are > gaps there. > I imagine there's going to be real dense (time-wise) activity) then an > RPC is prepared and send (Sending RPC ....) and then a lot sparser > activity perhaps with multi-second pauses) and then eventually it'll > pick up after gettign a server response for example? > Though none of that explains why lctl would hang I guess, but still > _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org