Would appreciate any leads on the above query. Thanks in advance. On Fri, 20 Sept 2024 at 14:31, Minulakshmi S <minulakshm...@gmail.com> wrote:
> Hello, > > *Issue 1:* > I am using slurm version 24.05.1 , my slurmd has a single node where I > connect multiple gres by enabling the overscribe feature. > I am able to use the advance reservation of gres only using *gres** name* > (tres=gres/gpu:*SYSTEM12*). > > > i.e while in reservation period , if other users submits job with SYSTEM12 > , then slurm places this job in queue > > *user1@host$ srun --gres=gpu:SYSTEM12:1 hostname* > *srun: job 333 queued and waiting for resources * > > but when other users just submit a job without any system name , slurm > jobs goes through on that gres immediately even though it is reserved. > > *user1@host$ srun --gres=gpu:1 hostname > * > *mylinux.wbi.com <http://mylinux.wbi.com/> * > > > Also I can see GresUsed in busy mode using "*scontrol show node -d*" , > this means the job is running on Gres/GPU and not on cpu etc. > > > Same way , job submission based on Feature "rev1 in my case" is also going > through even though it is reserved for other users in multiple partition > slurm. > > *snippet of slurm.conf file* > NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1 > SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1" > Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE > > *Issue 2:* > > while execution , Slurm o/p's some extra prints in the srun output > > user1@host$ srun --gres=gpu:1 hostname > > > srun: error: extract_net_cred: net_cred not provided > > srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017) > received > srun: error: > slurm_unpack_received_msg: [mylinux.wbi.com]:41242] Header lengths are > longer than data received > *mylinux.wbi.com <http://mylinux.wbi.com/>* > > Regards, > MS >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com