emy,
What is the value of TreeWidth in your slurm.conf? If there is no
entry then I recommend setting it to a value a bit larger than the
number of nodes you have in your cluster and then restarting slurmctld.
Best,
Steve
On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix
wrote:
Hi,
A fo
Hi,
A follow-up. I though some of nodes were ok but that's not the case;
This morning, another pool of consecutive (why consecutive by the way?
they are always consecutively failing) compute nodes are idle* . And now
of the nodes which were drained came back to life in idle and now again
swit
help,
Jeremy.
That looks like a DNS issue.
Verify all your nodes are able to resolve the names of each other.
Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the
nodes (including head/login nodes) to ensure they all match.
Brian Andrus
On 2/1/2022 1:37 AM, Jeremy Fix wrote
Hello everyone,
we are facing a weird issue. On a regular basis, some compute nodes go
from *idle* -> *idle** -> *down* and loop back to idle on its own; The
slurm manages several nodes and this state cycle appears only for some
pools of nodes.
We get a trace on the compute node as :
[2022
_xauthority”
> 5. Update your slurm cluster and restart.
>
>
>
> Steps 3&4 seemed to be the key ones I originally missed – especially 4
> (https://slurm.schedmd.com/slurm.conf.html#OPT_X11Parameters
> <https://slurm.schedmd.com/slurm.conf.html#OPT_X11Parameters>)
>
Hi,
I'm unsuccessful in running an X11 application with a remote
SlurmctldHost. Let us call myfrontalnode the node from which the user is
running the slurm commands that is different from the host SlurmctldHost.
What fails is the following :
ssh -X myfrontalnode
srun --x11 xclock
which
Actually, I solved the issue by observing that the user had created a
file "~/.vnc/xstartup*.sh*" while it should have been "~/.vnc/xstartup"
at least, simply removing the extension and vncserver starts
successfully, even in a srun !
Best;
Jeremy.
On 15/05/2021 14:
Hello !
I'm facing a weird issue. With one user, call it gpupro_user , if I log
with ssh on a compute node, I can run a vncserver (see command [1]
below) succesfully (in my case, a tigervnc server). However, if I
allocate the exact same node through a srun (see command [2] below),
running vnc ser
t; # echo "$(((NEXTRES - NOW) / 3600)) hours left until reservation begins"
> 178 hours left until reservation begins
>
> Cheers,
> Florian
>
>
>
> *From:* slurm-users on behalf
> of Jeremy Fix
> *Sent:* Monday, 29 March 2021 10:4
Hi,
I'm wondering if there is any built-in option to autoset a job TimeLimit
to fit within a defined reservation.
For now, it seems to me that the timelimit must be explicitely provided,
in a agreement with the deadline of the reservation, by a user when
invoking the srun or sbatch command while
10 matches
Mail list logo