On Thu, 2024-07-18 at 09:55 -0400, Michael DiDomenico via lustre-
discuss wrote:
> > In general debug logs are battle tested enough they should be
> > robust in
> > face of anything and not get stuck even if other parts of the
> > system
> > are unhappy, but if there's say a memory corruption that affects
> > one of
> > its structures, that might make it get stuck.
> 
> turns out when i came in this morning, the stuck node has written out
> 200mb of data.  unfortunately i'm not entirely sure what i'm looking
> for and i can't export the data even if you wanted to see it :(

did the pytorch job complete or was it still killed externally?
Is there an eviction in the log?
perhaps see continuity of the timestamps and what happened right before
and right after the gap if there is one in the times?

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to