On Wed, 2021-06-02 at 22:11 -0700, Ahmad Khalifa wrote:
> How to send a job to a particular gpu card using its ID
> (0,1,2...etc)?
If your GPUs are CUDA I can't help but, if you have OpenCL GPUs then
your program can select a GPU with a call to getDeviceIDs() and select
the GPU by number.
Starting
On Wed, Jun 2, 2021 at 10:13 PM Ahmad Khalifa wrote:
> How to send a job to a particular gpu card using its ID (0,1,2...etc)?
Well, you can't, because:
1. GPU ids are something of a relative concept:
https://bugs.schedmd.com/show_bug.cgi?id=10933
2. requesting specific GPUs is not supported:
ht
My only thought here that is a little off-kilter would be to get a stupid
do-nothing job assigned to the failing GPU for 100,000 hours… It might take a
bit of work - and some to and fro- but “fake occupy” the failing GPU and every
other job will maneuver around it.
Again - it’s not a great sol
Thank you for your input Jason, I wasn't trying to "chide" you in any way.
I appreciate your contribution to the discussion.
On Fri, Jun 4, 2021 at 11:37 AM Jason Simms wrote:
> You don't need to chide me for making what is, to me, a reasonable
> solution. *You* may not be able to make hardware
You don't need to chide me for making what is, to me, a reasonable
solution. *You* may not be able to make hardware changes, but why the
people who can would want failing GPUs remaining in a system is anathema to
my approach to cluster management. In other words, I do not recommend you
try to find
On 6/4/21 11:04 am, Ahmad Khalifa wrote:
Because there are failing GPUs that I'm trying to avoid.
Could you remove them from your gres.conf and adjust slurm.conf to match?
If you're using cgroups enforcement for devices (ConstrainDevices=yes in
cgroup.conf) then that should render them inacc
I can't make hardware changes, but I still want to make use of the cluster.
Let's keep the discussion on how to get slurm to do it, if that's possible.
On Fri, Jun 4, 2021 at 11:13 AM Jason Simms wrote:
> Unpopular opinion: remove the failing GPU.
>
> JLS
>
> On Fri, Jun 4, 2021 at 2:07 PM Ahmad
Unpopular opinion: remove the failing GPU.
JLS
On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa wrote:
> Because there are failing GPUs that I'm trying to avoid.
>
> On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth
> wrote:
>
>> On 03.06.21 07:11, Ahmad Khalifa wrote:
>> > How to send a job to a partic
Because there are failing GPUs that I'm trying to avoid.
On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth wrote:
> On 03.06.21 07:11, Ahmad Khalifa wrote:
> > How to send a job to a particular gpu card using its ID (0,1,2...etc)?
>
> Why do you need to access a GPU based on its ID?
>
> If its to sele
On 03.06.21 07:11, Ahmad Khalifa wrote:
How to send a job to a particular gpu card using its ID (0,1,2...etc)?
Why do you need to access a GPU based on its ID?
If its to select a certain GPU type, there are other methods you can use.
You could create partitions for the same GPU types or add f
Hi:
I've not tried to do that. But the below discussion might help:
https://bugs.schedmd.com/show_bug.cgi?id=2626
From: slurm-users On Behalf Of Ahmad
Khalifa
Sent: Thursday, June 3, 2021 01:12
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Specify a gpu ID
[EXTERNAL S
How to send a job to a particular gpu card using its ID (0,1,2...etc)?
12 matches
Mail list logo