Hello, 

I am operating a small cluster of 8 nodes that have 20 cores (2 10-core cpus) 
and 2 GPUs each (Nvidia K80). To date, I have been successfully running CUDA 
code where I typically submit single-cpu single-gpu jobs to nodes via slurm 
with the cons_res and CR_CPU options. 

More recently, I have been trying to use multiple MPI threads to access the a 
single GPU. The issue I am experiencing is that CUDA appears to reserve all 
available memory (system + GPU) for each MPI thread:
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
                                                                 
 95292 jan         20   0 1115m  14m 9872 R 100.5  0.0   0:02.07 prjmh          
                                                          
 95295 jan         20   0 26.3g 145m  95m R 100.5  0.5   0:01.81 prjmh          
                                                          
 95293 jan         20   0 26.3g 145m  95m R 98.6  0.5   0:01.80 prjmh           
                                                          
 95294 jan         20   0 26.3g 145m  95m R 98.6  0.5   0:01.81 prjmh

Note: PID 95292 is the master which does not access the GPU. The other three 
processes access the GPU. 

This results in slurm killing the job: 
slurmstepd: Exceeded job memory limit
slurmstepd: Step 5705.0 exceeded virtual memory limit (83806300 > 29491200), 
being killed
slurmstepd: Step 5705.0 exceeded virtual memory limit (83806300 > 29491200), 
being killed
slurmstepd: Exceeded job memory limit
slurmstepd: Exceeded job memory limit
slurmstepd: Exceeded job memory limit
srun: got SIGCONT
slurmstepd: *** JOB 5705 CANCELLED AT 2018-02-04T13:47:00 *** on compute-0-3
srun: forcing job termination
srun: error: compute-0-3: task 0: Killed
srun: error: compute-0-3: tasks 1-3: Killed

Note: When I log into the node and manually run the program with mpirun -np=20 
a.out, it runs without issues. 

Is there a way to change the configuration of slurm so it does not kill these 
jobs? I have read through the documentation to some extend but my limited slurm 
knowledge did not allow me to find a solution. 

Thanks very much, Jan 

Reply via email to