You don't need to chide me for making what is, to me, a reasonable solution. *You* may not be able to make hardware changes, but why the people who can would want failing GPUs remaining in a system is anathema to my approach to cluster management. In other words, I do not recommend you try to find a workaround to a solution that, in my opinion, is best solved by eliminating the faulty hardware. I understand the impulse, and if there is a simple solution to specifying a specific GPU, then fine, do that. But again it goes against treating such resources as generic - nodes and hardware should be thought of as cattle, not pets, and should be managed accordingly. Again, I believe you are trying to solve a problem that should not be yours to solve. Sorry if this irritates you.
JLS On Fri, Jun 4, 2021 at 2:17 PM Ahmad Khalifa <underoath...@gmail.com> wrote: > I can't make hardware changes, but I still want to make use of the > cluster. Let's keep the discussion on how to get slurm to do it, if that's > possible. > > On Fri, Jun 4, 2021 at 11:13 AM Jason Simms <sim...@lafayette.edu> wrote: > >> Unpopular opinion: remove the failing GPU. >> >> JLS >> >> On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa <underoath...@gmail.com> >> wrote: >> >>> Because there are failing GPUs that I'm trying to avoid. >>> >>> On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth <stephan.r...@ee.ethz.ch> >>> wrote: >>> >>>> On 03.06.21 07:11, Ahmad Khalifa wrote: >>>> > How to send a job to a particular gpu card using its ID (0,1,2...etc)? >>>> >>>> Why do you need to access a GPU based on its ID? >>>> >>>> If its to select a certain GPU type, there are other methods you can >>>> use. >>>> >>>> You could create partitions for the same GPU types or add features. >>>> Due to our heterogenous nodes with mixed GPU types we do the latter, we >>>> added a feature for the GPU architectures and one for the GPU types to >>>> each node. >>>> >>>> Cheers, >>>> Stephan >>>> >>>> >> >> -- >> *Jason L. Simms, Ph.D., M.P.H.* >> Manager of Research and High-Performance Computing >> XSEDE Campus Champion >> Lafayette College >> Information Technology Services >> 710 Sullivan Rd | Easton, PA 18042 >> Office: 112 Skillman Library >> p: (610) 330-5632 >> > -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research and High-Performance Computing XSEDE Campus Champion Lafayette College Information Technology Services 710 Sullivan Rd | Easton, PA 18042 Office: 112 Skillman Library p: (610) 330-5632