Did you build Slurm yourself from source? If so, when you build from source, on
that node, you need to have the munge-devel package installed (munge-devel on
EL systems, libmunge-dev on Debian)
You then need to set up munge with a shared munge key between the nodes, and
have the munge daemon ru
One ting to be aware about when setting partition states to down:
* Setting partition state=down will be reset if slurmctld is restarted.
Read the slurmctld man-page under the -R parameter. So it's better not to
restart slurmctld during the downtime.
/Ole
On 2/1/22 08:11, Ole Holm Nielsen w
Login nodes being down doesn't affect Slurm jobs at all (except if you run
slurmctld/slurmdbd on the login node ;-)
To stop new jobs from being scheduled for running, mark all partitions
down. This is useful when recovering the cluster from a power or cooling
downtime, for example.
I wrote
Brian / Christopher, that looks like a good process, thanks guys, I will do
some testing and let you know.
if I mark a partition down and it has running jobs, what happens to those
jobs, do they keep running?
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (p
On 1/31/22 9:25 pm, Brian Andrus wrote:
touch /etc/nologin
That will prevent new logins.
It's also useful that if you put a message in /etc/nologin then users
who are trying to login will get that message before being denied.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org
One possibility:
Sounds like your concern is folks with interactive jobs from the login
node that are running under screen/tmux.
That being the case, you need running jobs to end and not allow new
users to start tmux sessions.
Definitely doing 'scontrol update state=down partition=' for
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel wrote:
> On 1/31/22 4:41 pm, Sid Young wrote:
>
> > I need to replace a faulty DIMM chim in our login node so I
On 1/31/22 9:00 pm, Christopher Samuel wrote:
That would basically be the way
Thinking further on this a better way would be to mark your partitions
down, as it's likely you've got fewer partitions than compute nodes.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Ber
On 1/31/22 4:41 pm, Sid Young wrote:
I need to replace a faulty DIMM chim in our login node so I need to stop
new jobs being kicked off while letting the old ones end.
I thought I would just set all nodes to drain to stop new jobs from
being kicked off...
That would basically be the way, bu
Dear Ole and Hermann,
I have reinstalled slurm from scratch now following this link:
The error remains the same. Kindly guide me where will i find this
cred/munge plugin. Please help me resolve this issue.
[root@exxact slurm]# slurmd -C
NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1 CoresPer
Dear Ole,
Thank you for your response.
I am doing it again using your suggested link.
Best Regards,
Nousheen Parvaiz
ᐧ
On Mon, Jan 31, 2022 at 2:07 PM Ole Holm Nielsen
wrote:
> Hi Nousheen,
>
> I recommend you again to follow the steps for installing Slurm on a CentOS
> 7 cluster:
> https://
Best Regards,
Nousheen Parvaiz
Ph.D. Scholar
National Center For Bioinformatics
Quaid-i-Azam University, Islamabad
Dear Hermann,
Thank you for your reply. I have given below my slurm.conf and log file.
*# slurm.conf file generated by configurator easy.html.*# Put this file on
all nodes of your
G'Day all,
I need to replace a faulty DIMM chim in our login node so I need to stop
new jobs being kicked off while letting the old ones end.
I thought I would just set all nodes to drain to stop new jobs from being
kicked off... does this sound like a good idea? Down time window would be
20-30 m
Not a solution, but some ideas & experiences concerning the same topic:
A few of our older GPUs used to show the error message "has fallen off
the bus" which was only resolved by a full power cycle as well.
Something changed, nowadays the error messages is "GPU lost" and a
normal reboot reso
Make sure you properly configured nsswitch.conf.
Most commonly this kind of issue indicates that you forgot to define
initgroups correctly.
It should look something like this:
...
group: files [SUCCESS=merge] systemd [SUCCESS=merge] ldap
...
initgroups: files [SUCCESS=continue] ldap
...
I solved this issue by adding a group to IPA that matched the same name and
GID of the local groups, then using [SUCCESS=merge] in nsswitch.conf for
groups, and on our CentOS 8 nodes adding "enable_files_domain = False" in
the sssd.conf file*.*
On Fri, Jan 28, 2022 at 5:02 PM Ratnasamy, Fritz <
fr
This is not an answer on the MIG issue but on the question that Esben
has. We at SURF have developed sharing of all the GPUs in a node. We
"misuse" the SLURM mps feature. At SURF this mostly use for GPU courses,
eg: jupyterhub
We have tested it with slum version 20.11.8. The code is public
a
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to.
So I modified gres.conf on that node as follows:
I looked at option
> 2.2.3 using partial "AutoDetect=nvml"
again and saw that the reason for failure was indeed the sanity check,
but it was my fault because I set an invalid "Links" value for the
"hardcoded" GPUs. So this variant of gres.conf setup works and gives me
everything I want, sorry f
Hi Nousheen,
I recommend you again to follow the steps for installing Slurm on a CentOS
7 cluster:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
Maybe you will need to start installation from scratch, but the steps are
guaranteed to work if followed correctly.
IHTH,
Ole
On 1/31/22
Dear Nousheen,
I guess there is something missing in your installation - proably your
slurm.conf?
Do you have logging enabled for slurmctld? If yes what do you see in
that log?
Or what do you get if you run slurmctld manually like this:
/usr/local/sbin/slurmctld -D
Regards,
Hermann
On 1/3
21 matches
Mail list logo