Barry, Thank you so much for the reply! I'm afraid I need more clarification on this comment:
"CUDA_VISIBLE_DEVICES should only ever contain 1 integer between 0 and 3 if you have 4 GPUs." We only ever get CUDA_VISIBLE_DEVICES = *0* (12 times) and devices 1,2,3 are *never* used. Eventually - we want to be able to use mpi such that each rank/task can use 1 gpu -- but the job can spread tasks/ranks among the 4 gpus. Currently it appears we are only limited to device 0 only. *In a mpi context,* I'm not certain about the wrapper based method provided from the link. I'll need to consult with the developer. Thanks again! -C On Sat, Sep 2, 2017 at 10:49 AM, Barry Moore <[email protected]> wrote: > Charlie, > > % salloc -n 12 -c 2 -gres=gpu:1 >> % srun env | grep CUDA >> CUDA_VISIBLE_DEVICES=0 >> (12 times) >> *Is this expected behavior if we have more than 1 gpu available (4 total) >> for the 12 tasks?* > > > This is absolutely expected. You only ask for 1 GPU. CUDA_VISIBLE_DEVICES > should only ever contain 1 integer between 0 and 3 if you have 4 GPUs. > > This comment might help you: https://bugs.schedmd.com/ > show_bug.cgi?id=2626#c3 > > Basically, loop over the tasks you want to run with an index, take the > index % NUM_GPUS, use a wrapper like the comment. > > - Barry > > > On Fri, Sep 1, 2017 at 1:29 PM, charlie hemlock <[email protected]> > wrote: > >> Hello, >> Can the slurm forum help with these questions, or should we seek help >> elsewhere? >> >> We need help with salloc gpu allocation. Hopefully this clarifies some, >> given: >> >> % salloc -n 12 -c 2 -gres=gpu:1 >> % srun env | grep CUDA >> CUDA_VISIBLE_DEVICES=0 >> (12 times) >> >> *Is this expected behavior if we have more than 1 gpu available (4 total) >> for the 12 tasks?* >> >> We desire different behavior. *Is there a way to specify an salloc+srun >> to get:* >> >> CUDA_VISIBLE_DEVICES=0 >> CUDA_VISIBLE_DEVICES=1 >> CUDA_VISIBLE_DEVICES=2 >> CUDA_VISIBLE_DEVICES=3 >> And so on...(12 total print statements) ? >> >> such that each task gets 1 gpu, but overall gpu usage is spread out among >> the 4 available devices. (Not one where all device=0). >> >> That way each task is not waiting on device 0 to free up from other >> tasks, as is currently the case. >> >> What are we missing or misunderstanding? >> >> - salloc / srun parameter? >> - slurm.conf or gres.conf setting? >> >> Thank you! >> >> >> On Tue, Aug 29, 2017 at 12:27 PM, charlie hemlock < >> [email protected]> wrote: >> >>> Hello, >>> >>> We're looking for any advice for salloc/srun setup that uses 1 gpu/task >>> but where the job makes use of all available gpus. >>> >>> >>> *Test #1:* >>> >>> We desire an salloc and srun such that each task gets 1 GPU, but the GPU >>> usage for the job is spread out among 4 available devices. See gres.conf >>> below. >>> >>> >>> >>> % salloc -n 12 -c 2 -gres=gpu:1 >>> >>> >>> >>> % srun env | grep CUDA >>> >>> CUDA_VISIBLE_DEVICES=0 >>> >>> (12 times) >>> >>> >>> >>> Where we desire: >>> >>> CUDA_VISIBLE_DEVICES=0 >>> >>> CUDA_VISIBLE_DEVICES=1 >>> >>> CUDA_VISIBLE_DEVICES=2 >>> >>> CUDA_VISIBLE_DEVICES=3 >>> >>> And so on (12 times), such that each task still gets 1 gpu, but usage is >>> spread out among the 4 available devices (see gres.conf below). Not one >>> (device=0). >>> >>> That way each task is not waiting on device 0 to free up, as is >>> currently the case. >>> >>> >>> What are we missing or misunderstanding? >>> >>> - salloc / srun parameter? >>> - slurm.conf or gres.conf setting? >>> >>> >>> >>> Also see other additional tests below that illustrate current behavior: >>> >>> >>> >>> *Test #2* >>> >>> Here we believe each srun task will need 4 gpus each. >>> >>> % salloc -n 12 -c 2 -gres=gpu:4 >>> >>> % srun env | grep CUDA >>> >>> CUDA_VISIBLE_DEVICES=0,1,2,3 >>> >>> (12 times) >>> >>> >>> >>> This matches expectation. >>> >>> >>> >>> >>> >>> *Test #3* >>> >>> Another test, where I submit multiple sruns in succession: >>> >>> Here we use a simple sleepCUDA.py scripts which sleeps a few seconds, >>> and then prints $CUDA_VISIBLE_DEVICES) >>> >>> >>> >>> % salloc -n 12 -c 2 -gres=gpu:4 >>> >>> % srun -gres=gpu:1 sleepCUDA.py & >>> >>> % srun -gres=gpu:1 sleepCUDA.py & >>> >>> % srun -gres=gpu:1 sleepCUDA.py & >>> >>> % srun -gres=gpu:1 sleepCUDA.py & >>> >>> >>> >>> Result: >>> >>> CUDA_VISIBLE_DEVICES=0 (jobid 1) >>> >>> CUDA_VISIBLE_DEVICES=1 (jobid 2) >>> >>> CUDA_VISIBLE_DEVICES=2 (jobid 3) >>> >>> CUDA_VISIBLE_DEVICES=3 (jobid 4) >>> >>> And so on (but not necessarily in 0,1,2,3 order) >>> >>> Though a single srun submission would only use 1 gpu (device=0) as >>> before as expected. >>> >>> But this seems like a step in right direction since multiple devices >>> were used, but not quite what we wanted. >>> >>> >>> And according to: https://slurm.schedmd.com/ >>> archive/slurm-16.05.7/gres.html >>> >>> *“By default, a job step will be allocated all of the generic resources >>> allocated to the job/ [Test #2]* >>> >>> *If desired, the job step may explicitly specify a different generic >>> resource count than the job. [Test #3]”* >>> >>> >>> >>> To Test#3 non-iteractively should we look into creating an sbatch >>> script (with multiple sruns) instead of salloc? >>> >>> >>> >>> >>> *OS: *CentOS 7 >>> >>> *Slurm version: *16.05.6 >>> >>> >>> *gres.conf* >>> >>> Name=gpu File=/dev/nvidia0 >>> >>> Name=gpu File=/dev/nvidia1 >>> >>> Name=gpu File=/dev/nvidia2 >>> >>> Name=gpu File=/dev/nvidia3 >>> >>> >>> >>> *slurm.conf (truncated/partial/simplified)* >>> >>> NodeName=node1 Gres=gpu:4 >>> >>> NodeName=node2 Gres=gpu:4 >>> >>> NodeName=node3 Gres=gpu:4 >>> >>> NodeName=node4 Gres=gpu:4 >>> >>> GresTypes=gpu >>> >>> >>> >>> No cgroup.conf >>> >>> >>> >>> Posting actual .conf is not practical due to firewalls. >>> >>> >>> Any advice will be greatly appreciated! >>> >>> Thank you! >>> >> >> > > > -- > Barry E Moore II, PhD > E-mail: [email protected] > > Assistant Research Professor > Center for Simulation and Modeling > University of Pittsburgh > Pittsburgh, PA 15260 >
