Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-02 Thread Jeremy Fix
emy, What is the value of TreeWidth in your slurm.conf? If there is no entry then I recommend setting it to a value a bit larger than the number of nodes you have in your cluster and then restarting slurmctld. Best, Steve On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix wrote: Hi, A fo

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
Hi, A follow-up. I though some of nodes were ok but that's not the case; This morning, another pool of consecutive (why consecutive by the way? they are always consecutively failing) compute nodes are idle* . And now of the nodes which were drained came back to life in idle and now again swit

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
help, Jeremy. That looks like a DNS issue. Verify all your nodes are able to resolve the names of each other. Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the nodes (including head/login nodes) to ensure they all match. Brian Andrus On 2/1/2022 1:37 AM, Jeremy Fix wrote

[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
Hello everyone, we are facing a weird issue. On a regular basis, some compute nodes go from *idle* -> *idle** -> *down* and loop back to idle on its own;  The slurm manages several nodes and this state cycle appears only for some pools of nodes. We get a trace on the compute node as : [2022

Re: [slurm-users] Failed to forward X11 with a remote scheduler

2021-12-17 Thread Jeremy Fix
_xauthority” > 5. Update your slurm cluster and restart. > >   > > Steps 3&4 seemed to be the key ones I originally missed – especially 4 > (https://slurm.schedmd.com/slurm.conf.html#OPT_X11Parameters > <https://slurm.schedmd.com/slurm.conf.html#OPT_X11Parameters>) >

[slurm-users] Failed to forward X11 with a remote scheduler

2021-12-07 Thread Jeremy Fix
Hi, I'm unsuccessful in running an X11 application with a remote SlurmctldHost. Let us call myfrontalnode the node from which the user is running the slurm commands that is different from the host SlurmctldHost. What fails is the following : ssh -X myfrontalnode srun --x11 xclock which

Re: [slurm-users] Running vnc after srun fails but works after a direct ssh

2021-05-17 Thread Jeremy Fix
Actually, I solved the issue by observing that the user had created a file  "~/.vnc/xstartup*.sh*" while it should have been "~/.vnc/xstartup" at least, simply removing the extension and vncserver starts successfully, even in a srun ! Best; Jeremy. On 15/05/2021 14:

[slurm-users] Running vnc after srun fails but works after a direct ssh

2021-05-15 Thread Jeremy Fix
Hello ! I'm facing a weird issue. With one user, call it gpupro_user , if I log with ssh on a compute node, I can run a vncserver (see command [1]  below) succesfully (in my case, a tigervnc server). However, if I allocate the exact same node through a srun (see command [2] below), running vnc ser

Re: [slurm-users] [External] Autoset job TimeLimit to fit in a reservation

2021-03-30 Thread Jeremy Fix
t; # echo "$(((NEXTRES - NOW) / 3600)) hours left until reservation begins" > 178 hours left until reservation begins > > Cheers, > Florian > > > > *From:* slurm-users on behalf > of Jeremy Fix > *Sent:* Monday, 29 March 2021 10:4

[slurm-users] Autoset job TimeLimit to fit in a reservation

2021-03-29 Thread Jeremy Fix
Hi, I'm wondering if there is any built-in option to autoset a job TimeLimit to fit within a defined reservation. For now, it seems to me that the timelimit must be explicitely provided, in a agreement with the deadline of the reservation, by a user when invoking the srun or sbatch command while