Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Valerio Bellizzomi
On Wed, 2021-06-02 at 22:11 -0700, Ahmad Khalifa wrote: > How to send a job to a particular gpu card using its ID > (0,1,2...etc)? If your GPUs are CUDA I can't help but, if you have OpenCL GPUs then your program can select a GPU with a call to getDeviceIDs() and select the GPU by number. Starting

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Kilian Cavalotti
On Wed, Jun 2, 2021 at 10:13 PM Ahmad Khalifa wrote: > How to send a job to a particular gpu card using its ID (0,1,2...etc)? Well, you can't, because: 1. GPU ids are something of a relative concept: https://bugs.schedmd.com/show_bug.cgi?id=10933 2. requesting specific GPUs is not supported: ht

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Fuzzy Rogers
My only thought here that is a little off-kilter would be to get a stupid do-nothing job assigned to the failing GPU for 100,000 hours… It might take a bit of work - and some to and fro- but “fake occupy” the failing GPU and every other job will maneuver around it. Again - it’s not a great sol

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Ahmad Khalifa
Thank you for your input Jason, I wasn't trying to "chide" you in any way. I appreciate your contribution to the discussion. On Fri, Jun 4, 2021 at 11:37 AM Jason Simms wrote: > You don't need to chide me for making what is, to me, a reasonable > solution. *You* may not be able to make hardware

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Jason Simms
You don't need to chide me for making what is, to me, a reasonable solution. *You* may not be able to make hardware changes, but why the people who can would want failing GPUs remaining in a system is anathema to my approach to cluster management. In other words, I do not recommend you try to find

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Christopher Samuel
On 6/4/21 11:04 am, Ahmad Khalifa wrote: Because there are failing GPUs that I'm trying to avoid. Could you remove them from your gres.conf and adjust slurm.conf to match? If you're using cgroups enforcement for devices (ConstrainDevices=yes in cgroup.conf) then that should render them inacc

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Ahmad Khalifa
I can't make hardware changes, but I still want to make use of the cluster. Let's keep the discussion on how to get slurm to do it, if that's possible. On Fri, Jun 4, 2021 at 11:13 AM Jason Simms wrote: > Unpopular opinion: remove the failing GPU. > > JLS > > On Fri, Jun 4, 2021 at 2:07 PM Ahmad

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Jason Simms
Unpopular opinion: remove the failing GPU. JLS On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa wrote: > Because there are failing GPUs that I'm trying to avoid. > > On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth > wrote: > >> On 03.06.21 07:11, Ahmad Khalifa wrote: >> > How to send a job to a partic

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Ahmad Khalifa
Because there are failing GPUs that I'm trying to avoid. On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth wrote: > On 03.06.21 07:11, Ahmad Khalifa wrote: > > How to send a job to a particular gpu card using its ID (0,1,2...etc)? > > Why do you need to access a GPU based on its ID? > > If its to sele

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Oh, also ensure the dns is working properly on the node. It could be that it isn't able to map the name to ip of the master. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 N

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Sounds like a firewall issue. When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there? Also, verify munge is configured/running properly on the node. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Herc Silverstein
Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondem

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Stephan Roth
On 03.06.21 07:11, Ahmad Khalifa wrote: How to send a job to a particular gpu card using its ID (0,1,2...etc)? Why do you need to access a GPU based on its ID? If its to select a certain GPU type, there are other methods you can use. You could create partitions for the same GPU types or add f