Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu
partition.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote:
External Email - Use Caution
Hi Paul,
There could be multiple reasons why the job isn't running, from the
At HMS we do the same as Paul's cluster and specify the groups we want to have
access to all our compute nodes, we allow two groups that represent our DevOps
team and our Research Computing consultants to have access and then
corresponding sudo rules for each group to allow different command se
Hi Paul,
There could be multiple reasons why the job isn't running, from the user's QOS
to your cluster hitting MaxJobCount. This page might help:
https://slurm.schedmd.com/high_throughput.html
The output of the following command might help:
scontrol show job 465072
Regards
--
Mick Timony
Se
We do this by adding groups/users to /etc/security/access.conf That
should grant normal ssh access assuming you still have pam_access.so
still in your sshd config. Note that if the user has a job on the node,
slurm will still shunt them into that job even with the access.conf
setting. So when
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)
JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON
4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 pub
Hi all,
I'm answering to myself : in fact the memory leak happened when the slurm.conf
file was different on the nodes.
Sorry for the noise,
Have a good day,
Christine
De : LEROY Christine 208562 via slurm-users
Envoyé : mercredi 26 juin 2024 16:56
À : slurm-users@lists.schedmd.com
Cc : BLANCA
Hello,
We wish to have a schedulingintegration with Slurm. Our own application has a
backend system which willdecide the placement of jobs across hosts & CPU cores.
The backend takesits own time to come back with a placement (which may take a
few seconds) & we expect slurm to update it regularl