On Wed, 2021-06-02 at 22:11 -0700, Ahmad Khalifa wrote:
> How to send a job to a particular gpu card using its ID
> (0,1,2...etc)?
If your GPUs are CUDA I can't help but, if you have OpenCL GPUs then
your program can select a GPU with a call to getDeviceIDs() and select
the GPU by number.
Starting
On Wed, Jun 2, 2021 at 10:13 PM Ahmad Khalifa wrote:
> How to send a job to a particular gpu card using its ID (0,1,2...etc)?
Well, you can't, because:
1. GPU ids are something of a relative concept:
https://bugs.schedmd.com/show_bug.cgi?id=10933
2. requesting specific GPUs is not supported:
ht
My only thought here that is a little off-kilter would be to get a stupid
do-nothing job assigned to the failing GPU for 100,000 hours… It might take a
bit of work - and some to and fro- but “fake occupy” the failing GPU and every
other job will maneuver around it.
Again - it’s not a great sol
Thank you for your input Jason, I wasn't trying to "chide" you in any way.
I appreciate your contribution to the discussion.
On Fri, Jun 4, 2021 at 11:37 AM Jason Simms wrote:
> You don't need to chide me for making what is, to me, a reasonable
> solution. *You* may not be able to make hardware
You don't need to chide me for making what is, to me, a reasonable
solution. *You* may not be able to make hardware changes, but why the
people who can would want failing GPUs remaining in a system is anathema to
my approach to cluster management. In other words, I do not recommend you
try to find
On 6/4/21 11:04 am, Ahmad Khalifa wrote:
Because there are failing GPUs that I'm trying to avoid.
Could you remove them from your gres.conf and adjust slurm.conf to match?
If you're using cgroups enforcement for devices (ConstrainDevices=yes in
cgroup.conf) then that should render them inacc
I can't make hardware changes, but I still want to make use of the cluster.
Let's keep the discussion on how to get slurm to do it, if that's possible.
On Fri, Jun 4, 2021 at 11:13 AM Jason Simms wrote:
> Unpopular opinion: remove the failing GPU.
>
> JLS
>
> On Fri, Jun 4, 2021 at 2:07 PM Ahmad
Unpopular opinion: remove the failing GPU.
JLS
On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa wrote:
> Because there are failing GPUs that I'm trying to avoid.
>
> On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth
> wrote:
>
>> On 03.06.21 07:11, Ahmad Khalifa wrote:
>> > How to send a job to a partic
Because there are failing GPUs that I'm trying to avoid.
On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth wrote:
> On 03.06.21 07:11, Ahmad Khalifa wrote:
> > How to send a job to a particular gpu card using its ID (0,1,2...etc)?
>
> Why do you need to access a GPU based on its ID?
>
> If its to sele
Oh, also ensure the dns is working properly on the node. It could be
that it isn't able to map the name to ip of the master.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
N
Sounds like a firewall issue.
When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there?
Also, verify munge is configured/running properly on the node.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
[2021-05-25T00:12:27.482] sched: Allocate JobId=3402730
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondem
On 03.06.21 07:11, Ahmad Khalifa wrote:
How to send a job to a particular gpu card using its ID (0,1,2...etc)?
Why do you need to access a GPU based on its ID?
If its to select a certain GPU type, there are other methods you can use.
You could create partitions for the same GPU types or add f
13 matches
Mail list logo