Barry,
Thank you so much for the reply!
I'm afraid I need more clarification on this comment:

"CUDA_VISIBLE_DEVICES should only ever contain 1 integer between 0 and 3 if
you have 4 GPUs."

We only ever get
CUDA_VISIBLE_DEVICES = *0* (12 times)
and devices 1,2,3 are *never* used.

Eventually - we want to be able to use mpi such that each rank/task can use
1 gpu --  but the  job can spread tasks/ranks among the 4 gpus.  Currently
it appears we are only limited to device 0 only.

*In a mpi context,* I'm not certain about the wrapper based method provided
from the link.
I'll need to consult with the developer.

Thanks again!
-C








On Sat, Sep 2, 2017 at 10:49 AM, Barry Moore <[email protected]> wrote:

> Charlie,
>
> % salloc -n 12 -c 2 -gres=gpu:1
>> % srun  env | grep CUDA
>> CUDA_VISIBLE_DEVICES=0
>> (12 times)
>> *Is this expected behavior if we have more than 1 gpu available (4 total)
>> for the 12 tasks?*
>
>
> This is absolutely expected. You only ask for 1 GPU. CUDA_VISIBLE_DEVICES
> should only ever contain 1 integer between 0 and 3 if you have 4 GPUs.
>
> This comment might help you: https://bugs.schedmd.com/
> show_bug.cgi?id=2626#c3
>
> Basically, loop over the tasks you want to run with an index, take the
> index % NUM_GPUS, use a wrapper like the comment.
>
> - Barry
>
>
> On Fri, Sep 1, 2017 at 1:29 PM, charlie hemlock <[email protected]>
> wrote:
>
>> Hello,
>> Can the slurm forum help with these questions, or should we seek help
>> elsewhere?
>>
>> We need help with salloc gpu allocation.  Hopefully this clarifies some,
>> given:
>>
>> % salloc -n 12 -c 2 -gres=gpu:1
>> % srun  env | grep CUDA
>> CUDA_VISIBLE_DEVICES=0
>> (12 times)
>>
>> *Is this expected behavior if we have more than 1 gpu available (4 total)
>> for the 12 tasks?*
>>
>> We desire different behavior.  *Is there a way to specify an salloc+srun
>> to get:*
>>
>> CUDA_VISIBLE_DEVICES=0
>> CUDA_VISIBLE_DEVICES=1
>> CUDA_VISIBLE_DEVICES=2
>> CUDA_VISIBLE_DEVICES=3
>> And so on...(12 total print statements) ?
>>
>> such that each task gets 1 gpu, but overall gpu usage is spread out among
>> the 4 available devices.   (Not one where all device=0).
>>
>> That way each task is not waiting on device 0 to free up from other
>> tasks, as is currently the case.
>>
>> What are we missing or misunderstanding?
>>
>>    - salloc / srun parameter?
>>    - slurm.conf or gres.conf setting?
>>
>> Thank you!
>>
>>
>> On Tue, Aug 29, 2017 at 12:27 PM, charlie hemlock <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> We're looking for any advice for salloc/srun setup that uses 1 gpu/task
>>> but where the job makes use of all available gpus.
>>>
>>>
>>> *Test #1:*
>>>
>>> We desire an salloc and srun such that each task gets 1 GPU, but the GPU
>>> usage for the job is spread out among 4 available devices.  See gres.conf
>>> below.
>>>
>>>
>>>
>>> % salloc -n 12 -c 2 -gres=gpu:1
>>>
>>>
>>>
>>> % srun  env | grep CUDA
>>>
>>> CUDA_VISIBLE_DEVICES=0
>>>
>>> (12 times)
>>>
>>>
>>>
>>> Where we desire:
>>>
>>> CUDA_VISIBLE_DEVICES=0
>>>
>>> CUDA_VISIBLE_DEVICES=1
>>>
>>> CUDA_VISIBLE_DEVICES=2
>>>
>>> CUDA_VISIBLE_DEVICES=3
>>>
>>> And so on (12 times), such that each task still gets 1 gpu, but usage is
>>> spread out among the 4 available devices (see gres.conf below).   Not one
>>> (device=0).
>>>
>>> That way each task is not waiting on device 0 to free up, as is
>>> currently the case.
>>>
>>>
>>> What are we missing or misunderstanding?
>>>
>>>    - salloc / srun parameter?
>>>    - slurm.conf or gres.conf setting?
>>>
>>>
>>>
>>> Also see other additional tests below that illustrate current behavior:
>>>
>>>
>>>
>>> *Test #2*
>>>
>>> Here we believe each srun task will need 4 gpus each.
>>>
>>> % salloc -n 12 -c 2 -gres=gpu:4
>>>
>>> %  srun env | grep CUDA
>>>
>>> CUDA_VISIBLE_DEVICES=0,1,2,3
>>>
>>> (12 times)
>>>
>>>
>>>
>>> This matches expectation.
>>>
>>>
>>>
>>>
>>>
>>> *Test #3*
>>>
>>> Another test, where I submit multiple sruns in succession:
>>>
>>> Here we use a simple sleepCUDA.py scripts which sleeps a few seconds,
>>> and then prints $CUDA_VISIBLE_DEVICES)
>>>
>>>
>>>
>>> % salloc -n 12 -c 2 -gres=gpu:4
>>>
>>> %  srun -gres=gpu:1 sleepCUDA.py &
>>>
>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>
>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>
>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>
>>>
>>>
>>> Result:
>>>
>>> CUDA_VISIBLE_DEVICES=0  (jobid 1)
>>>
>>> CUDA_VISIBLE_DEVICES=1  (jobid 2)
>>>
>>> CUDA_VISIBLE_DEVICES=2  (jobid 3)
>>>
>>> CUDA_VISIBLE_DEVICES=3  (jobid 4)
>>>
>>> And so on (but not necessarily in 0,1,2,3 order)
>>>
>>> Though a single srun submission would only use 1 gpu (device=0) as
>>> before  as expected.
>>>
>>> But this seems like a step in right direction since multiple devices
>>> were used, but not quite what we wanted.
>>>
>>>
>>> And according to: https://slurm.schedmd.com/
>>> archive/slurm-16.05.7/gres.html
>>>
>>> *“By default, a job step will be allocated all of the generic resources
>>> allocated to the job/ [Test #2]*
>>>
>>> *If desired, the job step may explicitly specify a different generic
>>> resource count than the job. [Test #3]”*
>>>
>>>
>>>
>>> To Test#3 non-iteractively should we look into creating an sbatch
>>> script (with multiple sruns) instead of salloc?
>>>
>>>
>>>
>>>
>>> *OS: *CentOS 7
>>>
>>> *Slurm version: *16.05.6
>>>
>>>
>>> *gres.conf*
>>>
>>> Name=gpu File=/dev/nvidia0
>>>
>>> Name=gpu File=/dev/nvidia1
>>>
>>> Name=gpu File=/dev/nvidia2
>>>
>>> Name=gpu File=/dev/nvidia3
>>>
>>>
>>>
>>> *slurm.conf (truncated/partial/simplified)*
>>>
>>> NodeName=node1 Gres=gpu:4
>>>
>>> NodeName=node2 Gres=gpu:4
>>>
>>> NodeName=node3 Gres=gpu:4
>>>
>>> NodeName=node4 Gres=gpu:4
>>>
>>> GresTypes=gpu
>>>
>>>
>>>
>>> No cgroup.conf
>>>
>>>
>>>
>>> Posting actual .conf is not practical due to firewalls.
>>>
>>>
>>> Any advice will be greatly appreciated!
>>>
>>> Thank you!
>>>
>>
>>
>
>
> --
> Barry E Moore II, PhD
> E-mail: [email protected]
>
> Assistant Research Professor
> Center for Simulation and Modeling
> University of Pittsburgh
> Pittsburgh, PA 15260
>

Reply via email to