Hi Oliver,

I'm not sure if you have checked out the Generic Resource (GRES)
configuration. Slurm manages CUDA_VISIBLE_DEVICES well when the GRES is
configured.
Try taking a look at: https://slurm.schedmd.com/gres.html

I have used the instructions there verbatim and it works (meaning to say I
can see CUDA_VISIBLE_DEVICES set to all available GPU resources and as per
the job requirement).

HTH,
Pavan

On Fri, Apr 7, 2017 at 1:27 AM, Oliver Grant <olivercgr...@gmail.com> wrote:

> Hi Pavan,
>
> freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are
> requested. It was created as all jobs were running on GPU ID 0.
>
> Oliver
>
> On Thu, Apr 6, 2017 at 9:13 PM, pavan tc <pavan...@gmail.com> wrote:
>
>> Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I
>> guess your program "freegpus" does a little more?
>>
>> On Thu, Apr 6, 2017 at 6:32 AM, Oliver Grant <olivercgr...@gmail.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> I use a bash script to simultaneously submit multiple, single-GPU jobs
>>> to a cluster containing 18 nodes with 4 GPUs per node.
>>>
>>> #!/bin/bash
>>> #SBATCH -J jobName
>>> #SBATCH --partition=GPU
>>> #SBATCH --get-user-env
>>> #SBATCH --nodes=1
>>> #SBATCH --tasks-per-node=1
>>> #SBATCH --gres=gpu:1
>>>
>>> source /etc/profile.d/modules.sh
>>> export pmemd="srun $AMBERHOME/bin/pmemd.cuda "
>>> export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID)
>>> // Program uses nvidia-smi to figure out what GPUs are occupied.
>>>
>>> ${pmemd} -O \
>>> -i eq2.in \
>>> -o eq2.o \
>>> -p CPLX_Neut_Sol.prmtop \
>>> -c eq1.rst7 \
>>> -r eq2.rst7 \
>>> -x eq2.nc \
>>> -ref eq1.rst7
>>>
>>>
>>> We installed an extra 8 nodes recently and I find when submitting to
>>> those nodes I get four jobs running on a single GPU, while the other three
>>> GPUs are idle. If I wait 30 seconds between submission they go on separate
>>> GPUs (the behaviour I want). When submitting using the same scripts to the
>>> older nodes, all works fine. I've reproduced this multiple times. See a
>>> video of the problem here (note the quality may be better if you download
>>> first):
>>>
>>> https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0
>>> <https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0>
>>>
>>> I'm showing that the output of our program "freegpus" is ok, but when
>>> submitting two jobs to node015, they both go on the same GPU with ID 0.
>>> When submitting two jobs to node003, they go on separate GPUs. I've
>>> repeated this behaviour ~10 times. Once in a while the jobs seem to go
>>> straight to running, instead of hanging around as "PD" for several seconds.
>>> When that happens they do actually go on separate GPUs on node015!
>>>
>>> It seems like a SLURM bug, so I thought I'd post here.
>>> Any ideas?
>>>
>>> Oliver
>>>
>>
>>
>

Reply via email to