Re: [lustre-discuss] lustre and pytorch

Michael DiDomenico via lustre-discuss Fri, 19 Jul 2024 05:59:17 -0700

interestingly enough i ran

cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c


and while the timeline doesn't show a big gap, the number of records
per time interval does.  i can clearly see the slope of the job and
exactly where it seems to have fallen off

i might be able to get the small chunk of the log exported, but i'll
paraphrase the log a little below

starting with many entries repeat that repeat

using 4096 grant cretis for brw
osc grant
stripe kms increasing
set kms from
mdt id initial value
mdt id on ost
comp id #-1
obj ready
osc grant

then it stops and this comes out

ll_mert_attr update i_sze
bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt
obj ready update pending cmd
advancing bytes
range unlock
vvp_io_fini()
bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
writing x of x to x bytes
pos len copied
stripe kms increasing
set kms from
ll_do_tiny_write()
to take share lv owner
to release share lov owner
vfs op:onoe
to take share_lov
to_release share lov
lock match for
FLD lookup got mds 1
fld lookup got mds
CLOSE
mdc file closed with intent 200
@@@ matched open mdc
mdc modify rpc slot 1
@@@ reassign xid
sending RPC
set going to sleep for 11 seconds
completed rpc
fld lookup got mds 1
@@@ free open request

after the last command, it then spirals out repeating until the end of the log

vvp_io_init
lov_io_init
lov_conf_freeze to take share
lov_conf_thaw to release share

On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <gr...@whamcloud.com> wrote:
>
> On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote:
> > there's nothing 'grep -i evict' in the lustre_debug logs or from the
> > storage console logs
>
> hm, ok.
>
> > > perhaps see continuity of the timestamps and what happened right
> > > before
> > > and right after the gap if there is one in the times?
> > >
> >
> > i pulled a counter from the logs of the functions calls, maybe one of
> > these looks off (this is just the ones over 100k),  please excuse
> > typos
> >
> > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) |
> > sort
> > > uniq -c | sort -n
> > 105034 lov_io_init
> > 105034 vvp_io_init
> > 105035 lov_io_iter_init
> > 105035 lob_strip_intersects
> > 105035 osc_cache_writeback_range
> > 105043 vvp_io_fini
> > 105050 lov_conf_freeze
> > 105050 lov_conf_thaw
> > 294806 osc_attr-update
> > 294806 osc_page_touch_at
> > 294814 osc_consume_write_grant
> > 294815 lov_attr_get_composite
> > 294816 osc_enter_cache_try
> > 351044 ll_write_end
> > 589549 osc_queueu_async_io
> > 589617 lov_merge_lvm_kms
>
> that's not really going to do anything useful, there's a timestamp in
> unix time as the fourth field (separated with colons), see if there are
> gaps there.
> I imagine there's going to be real dense (time-wise) activity) then an
> RPC is prepared and send (Sending RPC ....) and then a lot sparser
> activity perhaps with multi-second pauses) and then eventually it'll
> pick up after gettign a server response for example?
> Though none of that explains why lctl would hang I guess, but still
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre and pytorch

Reply via email to