On Wed, Jul 17, 2024 at 10:01 PM Oleg Drokin <gr...@whamcloud.com> wrote: > Are the nodes synchronizing the job? Aka when one is stuck that impacts > the other from progressing further?
yes, i believe the way pytorch works is if one of the process fails to write out a checkpoint they all wait. but i'm not a pytorch expert, so... > In general debug logs are battle tested enough they should be robust in > face of anything and not get stuck even if other parts of the system > are unhappy, but if there's say a memory corruption that affects one of > its structures, that might make it get stuck. turns out when i came in this morning, the stuck node has written out 200mb of data. unfortunately i'm not entirely sure what i'm looking for and i can't export the data even if you wanted to see it :( _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org