[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu partition. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote: External Email - Use Caution Hi Paul, There could be multiple reasons why the job isn't running, from the

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Timony, Mick via slurm-users
At HMS we do the same as Paul's cluster and specify the groups we want to have access to all our compute nodes, we allow two groups that represent our DevOps team and our Research Computing consultants to have access and then corresponding sudo rules for each group to allow different command se

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Timony, Mick via slurm-users
Hi Paul, There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help: https://slurm.schedmd.com/high_throughput.html The output of the following command might help: scontrol show job 465072​ Regards -- Mick Timony Se

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Paul Edmon via slurm-users
We do this by adding groups/users to /etc/security/access.conf That should grant normal ssh access assuming you still have pam_access.so still in your sshd config.  Note that if the user has a job on the node, slurm will still shunt them into that job even with the access.conf setting.  So when

[slurm-users] Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu) JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON 4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 pub

[slurm-users] Re: vers 23: slurmctld pb (memory leak and response time)

2024-07-09 Thread LEROY Christine 208562 via slurm-users
Hi all, I'm answering to myself : in fact the memory leak happened when the slurm.conf file was different on the nodes. Sorry for the noise, Have a good day, Christine De : LEROY Christine 208562 via slurm-users Envoyé : mercredi 26 juin 2024 16:56 À : slurm-users@lists.schedmd.com Cc : BLANCA

[slurm-users] Custom Plugin Integration

2024-07-09 Thread Bhaskar Chakraborty via slurm-users
Hello, We wish to have a schedulingintegration with Slurm. Our own application has a backend system which willdecide the placement of jobs across hosts & CPU cores.  The backend takesits own time to come back with a placement (which may take a few seconds) & we expect slurm to update it regularl