On Wed, Jul 17, 2024 at 10:01 PM Oleg Drokin <gr...@whamcloud.com> wrote:
> Are the nodes synchronizing the job? Aka when one is stuck that impacts
> the other from progressing further?

yes, i believe the way pytorch works is if one of the process fails to
write out a checkpoint they all wait.  but i'm not a pytorch expert,
so...

> In general debug logs are battle tested enough they should be robust in
> face of anything and not get stuck even if other parts of the system
> are unhappy, but if there's say a memory corruption that affects one of
> its structures, that might make it get stuck.

turns out when i came in this morning, the stuck node has written out
200mb of data.  unfortunately i'm not entirely sure what i'm looking
for and i can't export the data even if you wanted to see it :(
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to