Hello all,
Thanks for the useful observations. Here is some further env vars:
# non problematic case
$ srun -c 3 --partition=gpu-2080ti env
SRUN_DEBUG=3
SLURM_JOB_CPUS_PER_NODE=4
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_CPUS_PER_TASK=3
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES
Hello Slurm users,
We are experiencing strange behavior with srun executing commands twice
only when setting --cpus-per-task=1
$ srun --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298286 queued and waiting for resources
srun: job 1298286 has been allocated resources
foo
foo
This i
Hello Mike,Doug:
The issue was resolved somehow. My colleagues says the addresses in
slurm.conf on the login nodes were incorrect. It could also have been a
temporary network issue.
Best,
Durai Arasan
MPI Tübingen
On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer wrote:
> Hi,
> Did you recent
Hello MIke,
I am able to ping the nodes from the slurm master without any problem.
Actually there is nothing interesting in slurmctld.log or slurmd.log. You
can trust me on this. That is why I posted here.
Best,
Durai Arasan
MPI Tuebingen
On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert wrote
Hello slurm users,
I forgot to mention that an identical interactive job works successfully on
the gpu partitions (in the same cluster). So this is really puzzling.
Best,
Durai Arasan
MPI Tuebingen
On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan wrote:
> Hello Slurm users,
>
> We are
ting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Best regards,
Durai Arasan
MPI Tuebingen
Hello!
We have a huge number of jobs stuck in CG state from a user who probably
wrote code with bad I/O. "scancel" does not make them go away. Is there a
way for admins to get rid of these jobs without draining and rebooting the
nodes. I read somewhere that killing the respective slurmstepd proces
I figured it out. slurmd doesn't run on login nodes. So you need a updated
copy of slurm.conf on the login nodes.
Best,
Durai
On Tue, Jul 20, 2021, 16:32 Ward Poelmans wrote:
> Hi,
>
> On 20/07/2021 16:01, Durai Arasan wrote:
> >
> > This is limited to this one nod
Hello,
We have set up "configless slurm" by passing a "conf-server" argument to
slurmd on all nodes. More details here:
https://slurm.schedmd.com/configless_slurm.html
one of the nodes is not able to pick up the configuration:
*>srun -w slurm-bm-70 --pty bash*
*srun: error: fwd_tree_thread:
Just submit your job
> and do your work on the node. Even if it is an interactive job. Keeps
> your dev/test environment the same as the runtime environment.
>
> Brian Andrus
>
> On 7/19/2021 7:09 AM, Durai Arasan wrote:
> > Hello,
> >
> > One of our slurm user'
Hello,
One of our slurm user's account is hung with uninterruptible processes.
These processes cannot be killed. Hence a restart is required. Is it
possible to restart the user's login environment alone? I would like to not
restart the entire login node.
Thanks!
Durai
Max Planck Institute Tübinge
Hi,
Frequently all of our GPU nodes (8xGPU each) are in MIXED state and there
is no IDLE node. Some jobs require a complete node (all 8 GPUs) and such
jobs therefore have to wait really long before they can run.
Is there a way of improving this situation? E.g. by not blocking IDLE nodes
with jobs
Hi,
Jobs submitted with sbatch cannot run on multiple partitions. The job will
be submitted to the partition where it can start first. (from sbatch
reference)
Best,
Durai
On Sat, Jan 23, 2021 at 6:50 AM Nizar Abed wrote:
> Hi list,
>
> I’m trying to enforce limits based on associations, but be
Hi,
As you know for each partition you can specify
AllowAccounts=account1,account2...
I have a parent account say "parent1" with two child accounts "child1"
and "child2"
I expected that setting AllowAccounts=parent1 will allow parent1,child1,
and child2 to submit jobs to that partition. But unfor
os,safe"
>
> Best,
>
> Matt
>
>
>
> On Wed, Oct 21, 2020 at 11:22 AM Durai Arasan
> wrote:
>
>> Hello,
>>
>> We recently created a new partition with the following slurm.conf and QOS
>> settings:
>>
>> *cat /etc/slu
Hello,
We recently created a new partition with the following slurm.conf and QOS
settings:
*cat /etc/slurm/slurm.conf | grep part-long*
*PartitionName=part-long Nodes=node-1,node-2,node-3 Default=YES,
AllowAccounts=group1,group2 TRESBillingWeights="gres/gpu=22" MaxNodes=1
MaxTime=10-0 QOS=long-10
gt; .
>
> You can also check the oversubscribe on a partition using sinfo -o "%h"
> option.
> sinfo -o '%P %.5a %.10h %N ' | head
>
> PARTITION AVAIL OVERSUBSCR NODELIST
>
>
> Look at the sinfo options for further details.
>
>
> Jackie
>
> O
Hello,
On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
Under these conditions multiple jobs should be able to run on the same
node. But they refuse to be allocated on the same node and only one job
runs on the node and rest of the jobs are in pending state.
When we changed S
Hi,
Please see job preemption:
https://slurm.schedmd.com/preempt.html
Best,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen
On Tue, Jul 7, 2020 at 6:45 PM zaxs84 wrote:
> Hi all.
>
> Is there a scheduler option that allows low priority jobs to be
> immediately paused (or
Hi,
Can you post the output of the following commands on your master node?:
sacctmgr show cluster
scontrol show nodes
Best,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen
On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines <
angelines.albe...@ciemat.es> wrote:
> Hi,
&
sources used and the
resources promised (correct me if I am wrong).
Unfortunately both these concepts are used to determine priority and so it
confuses me.
Are these two concepts different?
Thanks,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen
be automatically managed by configuration instead of being specified by the
job submitter, say using the "--cpus-per-gpu" option.
Which part of SLURM configuration can be used to enforce this ratio?
Thank you,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen
another option
that can do it?
Thanks,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen
Thank you. We are planning to put ssh keys on login nodes only and use the
PAM module to control access to compute nodes. Will such a setup work? Or
is it necessary for PAM to work to have the ssh keys on the compute nodes
as well? I'm sorry but this is not clearly mentioned on any
documentation..
e
>
> [1] https://slurm.schedmd.com/pam_slurm_adopt.html
>
>
> >> On Jun 8, 2020, at 12:01 , Durai Arasan wrote:
> >>
> >> Hi Jeffrey,
> >>
> >> Thanks for the clarification.
> >>
> >> But this is concerning, as the use
every node. Thus, they have the same id_rsa key and authorized_keys file
> present on all nodes.
>
>
>
>
> > On Jun 8, 2020, at 11:42 , Durai Arasan wrote:
> >
> > Ok, that was useful information.
> >
> > So when you provision user accounts, you add th
Ok, that was useful information.
So when you provision user accounts, you add the public key to
.ssh/authorized_keys of **all* *nodes on the cluster? Not just the login
nodes.. ?
> When we provision user accounts on our Slurm cluster we still add .ssh,
> .ssh/id_rsa (needed for older X11 tunnelin
Hi,
we are setting up a slurm cluster and are at the stage of adding ssh keys
of the users to the nodes.
I thought it would be sufficient to add the ssh keys of the users to only
the designated login nodes. But I heard that it is also necessary to add
them to the compute nodes as well for slurm t
Hi,
We have a cluster with 2 slave nodes. These are the slurm.conf lines
describing nodes and partitions:
*NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2
State=UNKNOWNNodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1
Gres=gpu:0 State=UNKNOWNPartitionName=gpu Nodes=slurm-gpu
When configuring a slurm cluster you need to have a copy of the
configuration file slurm.conf on all nodes. These copies are identical. In
the situation where you need to use GPUs in your cluster you have an
additional configuration file that you need to have on all nodes. This is
the gres.conf. My
30 matches
Mail list logo