Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Durai Arasan
Hello all, Thanks for the useful observations. Here is some further env vars: # non problematic case $ srun -c 3 --partition=gpu-2080ti env SRUN_DEBUG=3 SLURM_JOB_CPUS_PER_NODE=4 SLURM_NTASKS=1 SLURM_NPROCS=1 SLURM_CPUS_PER_TASK=3 SLURM_STEP_ID=0 SLURM_STEPID=0 SLURM_NNODES=1 SLURM_JOB_NUM_NODES

[slurm-users] srun and --cpus-per-task

2022-03-24 Thread Durai Arasan
Hello Slurm users, We are experiencing strange behavior with srun executing commands twice only when setting --cpus-per-task=1 $ srun --cpus-per-task=1 --partition=gpu-2080ti echo foo srun: job 1298286 queued and waiting for resources srun: job 1298286 has been allocated resources foo foo This i

Re: [slurm-users] [External] Re: srun : Communication connection failure

2022-01-25 Thread Durai Arasan
Hello Mike,Doug: The issue was resolved somehow. My colleagues says the addresses in slurm.conf on the login nodes were incorrect. It could also have been a temporary network issue. Best, Durai Arasan MPI Tübingen On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer wrote: > Hi, > Did you recent

Re: [slurm-users] [External] Re: srun : Communication connection failure

2022-01-21 Thread Durai Arasan
Hello MIke, I am able to ping the nodes from the slurm master without any problem. Actually there is nothing interesting in slurmctld.log or slurmd.log. You can trust me on this. That is why I posted here. Best, Durai Arasan MPI Tuebingen On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert wrote

Re: [slurm-users] srun : Communication connection failure

2022-01-20 Thread Durai Arasan
Hello slurm users, I forgot to mention that an identical interactive job works successfully on the gpu partitions (in the same cluster). So this is really puzzling. Best, Durai Arasan MPI Tuebingen On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan wrote: > Hello Slurm users, > > We are

[slurm-users] srun : Communication connection failure

2022-01-20 Thread Durai Arasan
ting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Best regards, Durai Arasan MPI Tuebingen

[slurm-users] jobs stuck in "CG" state

2021-08-20 Thread Durai Arasan
Hello! We have a huge number of jobs stuck in CG state from a user who probably wrote code with bad I/O. "scancel" does not make them go away. Is there a way for admins to get rid of these jobs without draining and rebooting the nodes. I read somewhere that killing the respective slurmstepd proces

Re: [slurm-users] problem with "configless" slurm.conf

2021-07-20 Thread Durai Arasan
I figured it out. slurmd doesn't run on login nodes. So you need a updated copy of slurm.conf on the login nodes. Best, Durai On Tue, Jul 20, 2021, 16:32 Ward Poelmans wrote: > Hi, > > On 20/07/2021 16:01, Durai Arasan wrote: > > > > This is limited to this one nod

[slurm-users] problem with "configless" slurm.conf

2021-07-20 Thread Durai Arasan
Hello, We have set up "configless slurm" by passing a "conf-server" argument to slurmd on all nodes. More details here: https://slurm.schedmd.com/configless_slurm.html one of the nodes is not able to pick up the configuration: *>srun -w slurm-bm-70 --pty bash* *srun: error: fwd_tree_thread:

Re: [slurm-users] restart user login ONLY

2021-07-20 Thread Durai Arasan
Just submit your job > and do your work on the node. Even if it is an interactive job. Keeps > your dev/test environment the same as the runtime environment. > > Brian Andrus > > On 7/19/2021 7:09 AM, Durai Arasan wrote: > > Hello, > > > > One of our slurm user'

[slurm-users] restart user login ONLY

2021-07-19 Thread Durai Arasan
Hello, One of our slurm user's account is hung with uninterruptible processes. These processes cannot be killed. Hence a restart is required. Is it possible to restart the user's login environment alone? I would like to not restart the entire login node. Thanks! Durai Max Planck Institute Tübinge

[slurm-users] schedule mixed nodes first

2021-05-14 Thread Durai Arasan
Hi, Frequently all of our GPU nodes (8xGPU each) are in MIXED state and there is no IDLE node. Some jobs require a complete node (all 8 GPUs) and such jobs therefore have to wait really long before they can run. Is there a way of improving this situation? E.g. by not blocking IDLE nodes with jobs

Re: [slurm-users] associations, limits,qos

2021-01-25 Thread Durai Arasan
Hi, Jobs submitted with sbatch cannot run on multiple partitions. The job will be submitted to the partition where it can start first. (from sbatch reference) Best, Durai On Sat, Jan 23, 2021 at 6:50 AM Nizar Abed wrote: > Hi list, > > I’m trying to enforce limits based on associations, but be

[slurm-users] Parent account in AllowAccounts

2021-01-15 Thread Durai Arasan
Hi, As you know for each partition you can specify AllowAccounts=account1,account2... I have a parent account say "parent1" with two child accounts "child1" and "child2" I expected that setting AllowAccounts=parent1 will allow parent1,child1, and child2 to submit jobs to that partition. But unfor

Re: [slurm-users] Partition QOS limit not being enforced

2020-10-26 Thread Durai Arasan
os,safe" > > Best, > > Matt > > > > On Wed, Oct 21, 2020 at 11:22 AM Durai Arasan > wrote: > >> Hello, >> >> We recently created a new partition with the following slurm.conf and QOS >> settings: >> >> *cat /etc/slu

[slurm-users] Partition QOS limit not being enforced

2020-10-21 Thread Durai Arasan
Hello, We recently created a new partition with the following slurm.conf and QOS settings: *cat /etc/slurm/slurm.conf | grep part-long* *PartitionName=part-long Nodes=node-1,node-2,node-3 Default=YES, AllowAccounts=group1,group2 TRESBillingWeights="gres/gpu=22" MaxNodes=1 MaxTime=10-0 QOS=long-10

Re: [slurm-users] CR_Core_Memory behavior

2020-08-26 Thread Durai Arasan
gt; . > > You can also check the oversubscribe on a partition using sinfo -o "%h" > option. > sinfo -o '%P %.5a %.10h %N ' | head > > PARTITION AVAIL OVERSUBSCR NODELIST > > > Look at the sinfo options for further details. > > > Jackie > > O

[slurm-users] CR_Core_Memory behavior

2020-08-25 Thread Durai Arasan
Hello, On our cluster we have SelectTypeParameters set to "CR_Core_Memory". Under these conditions multiple jobs should be able to run on the same node. But they refuse to be allocated on the same node and only one job runs on the node and rest of the jobs are in pending state. When we changed S

Re: [slurm-users] Automatically stop low priority jobs when submitting high priority jobs

2020-07-09 Thread Durai Arasan
Hi, Please see job preemption: https://slurm.schedmd.com/preempt.html Best, Durai Arasan Zentrum für Datenverarbeitung Tübingen On Tue, Jul 7, 2020 at 6:45 PM zaxs84 wrote: > Hi all. > > Is there a scheduler option that allows low priority jobs to be > immediately paused (or

Re: [slurm-users] fail job

2020-06-30 Thread Durai Arasan
Hi, Can you post the output of the following commands on your master node?: sacctmgr show cluster scontrol show nodes Best, Durai Arasan Zentrum für Datenverarbeitung Tübingen On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines < angelines.albe...@ciemat.es> wrote: > Hi, &

[slurm-users] Difference between fairshare and fair-share?

2020-06-25 Thread Durai Arasan
sources used and the resources promised (correct me if I am wrong). Unfortunately both these concepts are used to determine priority and so it confuses me. Are these two concepts different? Thanks, Durai Arasan Zentrum für Datenverarbeitung Tübingen

[slurm-users] Enforcing GPU-CPU ratios

2020-06-23 Thread Durai Arasan
be automatically managed by configuration instead of being specified by the job submitter, say using the "--cpus-per-gpu" option. Which part of SLURM configuration can be used to enforce this ratio? Thank you, Durai Arasan Zentrum für Datenverarbeitung Tübingen

[slurm-users] How to exclude nodes in sbatch/srun?

2020-06-22 Thread Durai Arasan
another option that can do it? Thanks, Durai Arasan Zentrum für Datenverarbeitung Tübingen

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-16 Thread Durai Arasan
Thank you. We are planning to put ssh keys on login nodes only and use the PAM module to control access to compute nodes. Will such a setup work? Or is it necessary for PAM to work to have the ssh keys on the compute nodes as well? I'm sorry but this is not clearly mentioned on any documentation..

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-09 Thread Durai Arasan
e > > [1] https://slurm.schedmd.com/pam_slurm_adopt.html > > > >> On Jun 8, 2020, at 12:01 , Durai Arasan wrote: > >> > >> Hi Jeffrey, > >> > >> Thanks for the clarification. > >> > >> But this is concerning, as the use

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Durai Arasan
every node. Thus, they have the same id_rsa key and authorized_keys file > present on all nodes. > > > > > > On Jun 8, 2020, at 11:42 , Durai Arasan wrote: > > > > Ok, that was useful information. > > > > So when you provision user accounts, you add th

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Durai Arasan
Ok, that was useful information. So when you provision user accounts, you add the public key to .ssh/authorized_keys of **all* *nodes on the cluster? Not just the login nodes.. ? > When we provision user accounts on our Slurm cluster we still add .ssh, > .ssh/id_rsa (needed for older X11 tunnelin

[slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Durai Arasan
Hi, we are setting up a slurm cluster and are at the stage of adding ssh keys of the users to the nodes. I thought it would be sufficient to add the ssh keys of the users to only the designated login nodes. But I heard that it is also necessary to add them to the compute nodes as well for slurm t

[slurm-users] slurm only looking in "default" partition during scheduling

2020-05-12 Thread Durai Arasan
Hi, We have a cluster with 2 slave nodes. These are the slurm.conf lines describing nodes and partitions: *NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2 State=UNKNOWNNodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0 State=UNKNOWNPartitionName=gpu Nodes=slurm-gpu

[slurm-users] Should there be a different gres.conf for each node?

2020-03-05 Thread Durai Arasan
When configuring a slurm cluster you need to have a copy of the configuration file slurm.conf on all nodes. These copies are identical. In the situation where you need to use GPUs in your cluster you have an additional configuration file that you need to have on all nodes. This is the gres.conf. My