Hi Jean-Mathieu, I'd also recommend that you update to 17.11.12. I had issues w/job arrays in 17.11.7, such as tasks erroneously being held as "DependencyNeverSatisfied" that, I'm pleased to report, I have not seen in .12.
Best, Lyn On Fri, Jan 11, 2019 at 8:13 AM Jean-mathieu CHANTREIN < jean-mathieu.chantr...@univ-angers.fr> wrote: > *You don't put any limitation on your master nodes ?* > > I answer myself. I only have to change the PropagateResourceLimits > variable from slurm.conf to NONE. This is not a problem since I activate > the cgroups directly on each of the compute nodes. > > Regards. > > Jean-Mathieu > > ------------------------------ > > *De: *"Jean-Mathieu Chantrein" <jean-mathieu.chantr...@univ-angers.fr> > *À: *"Slurm User Community List" <slurm-users@lists.schedmd.com> > *Envoyé: *Vendredi 11 Janvier 2019 15:55:35 > *Objet: *Re: [slurm-users] Array job execution trouble: some jobs > in the array fail > > Hello Jeffrey. > > That's exactly it. I thank you very much, I would not have thought of > that. I have actually put a limitation of 20 nproc in > /etc/security/limits.conf to avoid potential misuse of some users. I had > not imagined for one second that it could propagate on computational nodes! > > You don't put any limitation on your master nodes ? > > In any case, your help is particularly useful to me. Thanks a lot again. > > Best regards. > > Jean-Mathieu > > > ------------------------------ > > *De: *"Jeffrey Frey" <f...@udel.edu> > *À: *"Slurm User Community List" <slurm-users@lists.schedmd.com> > *Envoyé: *Vendredi 11 Janvier 2019 15:27:13 > *Objet: *Re: [slurm-users] Array job execution trouble: some jobs in > the array fail > > What does ulimit tell you on the compute node(s) where the jobs are > running? The error message you cited arises when a user has reached the > per-user process count limit (e.g. "ulimit -u"). If your Slurm config > doesn't limit how many jobs a node can execute concurrently (e.g. > oversubscribe), then: > > > - no matter what you have a race condition here (when/if the process limit > is reached) > > - the behavior is skewed toward happening more quickly/easily when your > job actually lasts a non-trivial amount of time (e.g. by adding the > usleep()). > > > > It's likely you have stringent limits on your head/login node that are > getting propagated to the compute environment (see PropagateResourceLimits > in the slurm.conf documentation). By default Slurm propagates all ulimit's > that are on your submission shell. > > > E.g. > > [frey@login00 ~]$ srun ... --propagate=NONE /bin/bash > [frey@login00 ~]$ hostname > r00n56.localdomain.hpc.udel.edu > [frey@login00 ~]$ ulimit -u > 4096 > [frey@login00 ~]$ exit > : > [frey@login00 ~]$ ulimit -u 24 > [frey@login00 ~]$ srun ... --propagate=ALL /bin/bash > [frey@login00 ~]$ hostname > r00n49.localdomain.hpc.udel.edu > [frey@login00 ~]$ ulimit -u > 24 > [frey@login00 ~]$ exit > > > > On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN < > jean-mathieu.chantr...@univ-angers.fr> wrote: > > Hello, > > I'm new to slurm (I used SGE before) and I'm new to this list. I have some > difficulties with the use of slurm's array jobs, maybe you can help me? > > I am working with slurm version 17.11.7 on a debian testing. I use > slurmdbd and fairshare. > > For my current user, I have the following limitations: > Fairshare = 99 > MaxJobs = 50 > MaxSubmitJobs = 100 > > I did a little C++ program hello_world to do some tests and a 100 job > hello_world array job is working properly. > If I take the same program but I add a usleep of 10 seconds (to see the > behavior with squeue and simulate a program a little longer), I have a part > of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and > WEXITSTATUS 254 (in slurm log). The proportion of the error number of these > jobs is variable between different executions. Here is the error output of > one of these jobs: > > $ cat ERR/11617-9 > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily > unavailable > > Note I have enough resources to run more than 50 jobs at the same time ... > > If I restart my submission script by forcing slurm to execute only 10 jobs > at the same time (--array=1-100%10), all jobs succeed. But if I force slurm > to execute only 30 jobs at the same time (--array=1-100%30), I have a part > that fails again. > > Has anyone ever faced this type of problem? If so, please kindly enlighten > me. > > Regards > > Jean-Mathieu Chantrein > In charge of the LERIA computing center > University of Angers > > __________________ > hello_array.slurm > > #!/bin/bash > # hello.slurm > #SBATCH --job-name=hello > #SBATCH --output=OUT/%A-%a > #SBATCH --error=ERR/%A-%a > #SBATCH --partition=std > #SBATCH --array=1-100%10 > ./hello $SLURM_ARRAY_TASK_ID > > ________________ > main.cpp > > #include <iostream> > #include <unistd.h> > > int main(int arg, char** argv) { > usleep(10000000); > std::cout<<"Hello world! job array number "<<argv[1]<<std::endl; > return 0; > } > > > > > :::::::::::::::::::::::::::::::::::::::::::::::::::::: > Jeffrey T. Frey, Ph.D. > Systems Programmer V / HPC Management > Network & Systems Services / College of Engineering > University of Delaware, Newark DE 19716 > Office: (302) 831-6034 Mobile: (302) 419-4976 > :::::::::::::::::::::::::::::::::::::::::::::::::::::: > > > > > >