Dear Arko,

Arko Roy via slurm-users <slurm-users@lists.schedmd.com> writes:

> I want to run 50 sequential jobs (essentially 50 copies of the same code with 
> different input parameters) on a particular node. However, as soon as one of 
> the
> jobs gets executed, the other 49 jobs get killed immediately with exit code 
> 9.  The jobs are not interacting and are strictly parallel. However, if the 
> 50 jobs run on
> 50 different nodes, it runs successfully.  
> Can anyone please help with possible fixes?
> I see a discussion almost along the similar lines in
> https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4
> But could not get the final solution.

If the jobs are independent, why do you want to run them all on the same
node?

If you do have problems when jobs run on the same node, there may be an
issue with the jobs all trying to access a single resource, such as a
file.  However, you probably need to show your job script in order for
anyone to be able to work out what is going on.

Regards

Loris

> -- 
> Arko Roy
> Assistant Professor
> School of Physical Sciences
> Indian Institute of Technology Mandi
> Kamand, Mandi
> Himachal Pradesh - 175 005, India
> Email: a...@iitmandi.ac.in
> Web: https://faculty.iitmandi.ac.in/~arko/
-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to