On Thu, Jul 11, 2024 at 5:24 PM Oleg Drokin <gr...@whamcloud.com> wrote:
> does it ever resume or does it stop-stop? If you have a hard stop after
> which the thing is killed - how long is it?
> Are the writes synchronous? an you collect lustre debug logs from one
> of the clients with +vfstrace+cache+rpctrace+inode debug mask may be
> when the hang happens?

it stop-stops, there's a 30min timeout inside pytorch which gets
triggered if we leave it be

i'll see if i can grab a debug log, i haven't had to do that in a long
time, not sure if i recall exactly how.

> How many files are there? I assume there's only a limited number of
> processes per node?

the smallest test i believe we've been able to reproduce it on is 1
node with 8 gpu's.  so in theory there would be eight files one for
each process

> Were obvious things like "a bunch of nodes writing into the same file
> in O_APPEND mode" already eliminated? (or not in O_APPEND, but
> doing truncates in between)

i can't answer that directly as i don't understand the code.  but my
understanding was that each "process" in the run writes out its own
file and that it's not "appended".  the process halts and then dumps
out it's memory to a file and resumes

> Also what version are you running?

of lustre or pytorch?  lustre is 2.15client 2.12server, pytorch i'm
not sure, the devs use venv's to setup their work environment.  but
i'm going to presume it's likely the latest code
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to