Hi Oliver, I'm not sure if you have checked out the Generic Resource (GRES) configuration. Slurm manages CUDA_VISIBLE_DEVICES well when the GRES is configured. Try taking a look at: https://slurm.schedmd.com/gres.html
I have used the instructions there verbatim and it works (meaning to say I can see CUDA_VISIBLE_DEVICES set to all available GPU resources and as per the job requirement). HTH, Pavan On Fri, Apr 7, 2017 at 1:27 AM, Oliver Grant <olivercgr...@gmail.com> wrote: > Hi Pavan, > > freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are > requested. It was created as all jobs were running on GPU ID 0. > > Oliver > > On Thu, Apr 6, 2017 at 9:13 PM, pavan tc <pavan...@gmail.com> wrote: > >> Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I >> guess your program "freegpus" does a little more? >> >> On Thu, Apr 6, 2017 at 6:32 AM, Oliver Grant <olivercgr...@gmail.com> >> wrote: >> >>> Hi there, >>> >>> I use a bash script to simultaneously submit multiple, single-GPU jobs >>> to a cluster containing 18 nodes with 4 GPUs per node. >>> >>> #!/bin/bash >>> #SBATCH -J jobName >>> #SBATCH --partition=GPU >>> #SBATCH --get-user-env >>> #SBATCH --nodes=1 >>> #SBATCH --tasks-per-node=1 >>> #SBATCH --gres=gpu:1 >>> >>> source /etc/profile.d/modules.sh >>> export pmemd="srun $AMBERHOME/bin/pmemd.cuda " >>> export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID) >>> // Program uses nvidia-smi to figure out what GPUs are occupied. >>> >>> ${pmemd} -O \ >>> -i eq2.in \ >>> -o eq2.o \ >>> -p CPLX_Neut_Sol.prmtop \ >>> -c eq1.rst7 \ >>> -r eq2.rst7 \ >>> -x eq2.nc \ >>> -ref eq1.rst7 >>> >>> >>> We installed an extra 8 nodes recently and I find when submitting to >>> those nodes I get four jobs running on a single GPU, while the other three >>> GPUs are idle. If I wait 30 seconds between submission they go on separate >>> GPUs (the behaviour I want). When submitting using the same scripts to the >>> older nodes, all works fine. I've reproduced this multiple times. See a >>> video of the problem here (note the quality may be better if you download >>> first): >>> >>> https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0 >>> <https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0> >>> >>> I'm showing that the output of our program "freegpus" is ok, but when >>> submitting two jobs to node015, they both go on the same GPU with ID 0. >>> When submitting two jobs to node003, they go on separate GPUs. I've >>> repeated this behaviour ~10 times. Once in a while the jobs seem to go >>> straight to running, instead of hanging around as "PD" for several seconds. >>> When that happens they do actually go on separate GPUs on node015! >>> >>> It seems like a SLURM bug, so I thought I'd post here. >>> Any ideas? >>> >>> Oliver >>> >> >> >