[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
It turns out that the Slurm job limits are *not* controlled by the normal 
/etc/security/limits.conf configuration.  Any service running under 
Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2].


The limits of processes started by slurmd are defined by LimitXXX in 
/usr/lib/systemd/system/slurmd.service, and current Slurm versions have 
LimitNOFILE=131072.


I guess that LimitNOFILE is the limit applied to every Slurm job, and that 
jobs presumably ought to crash if opening more than LimitNOFILE files?


If this is correct, I think the kernel's fs.file-max ought to be set to 
131072 times the maximum possible number of Slurm jobs per node, plus a 
safety margin for the OS.  Depending on Slurm configuration, fs.file-max 
should be set to 131072 times number of CPUs plus some extra margin.  For 
example, a 96-core node might have fs.file-max set to 100*131072 = 13107200.


Does this make sense?

Best regards,
Ole

[1] "How to set limits for services in RHEL and systemd" 
https://access.redhat.com/solutions/1257953
[2] 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#slurmd-systemd-limits


On 4/18/24 11:23, Ole Holm Nielsen wrote:
I looked at some of our busy 96-core nodes where users are currently 
running the STAR-CCM+ CFD software.


One job runs on 4 96-core nodes.  I'm amazed that each STAR-CCM+ process 
has opened almost 1000 open files, for example:


$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u  | wc -l
94606

So it's no wonder that 65536 open files would have been exhausted, and 
that my current limit is just barely sufficient:


$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:

2024-04-18 10:54:33 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:55:34 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:56:35 +0200 Info:  Failed to accept connection: Too many 
open files in system

2024-04-18 10:57:22 +0200 Info:  Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:  Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:

[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error



The node became completely non-responsive until I restored 
fs.file-max=131072.


Conclusions:

1. Munge should be upgraded to 0.5.15 or later to avoid the munged.log 
filling up the disk.  I summarize this in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service


2. We still need some heuristics for determining sufficient values for the 
kernel's fs.file-max limit.  I don't understand whether the kernel itself 
might set good default values, which we have noticed on some servers and 
login nodes.


As Jeffrey points out, there are both soft and hard user limits on the 
number of files, and this is what I see for a normal user:


$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144

Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU core 
count (if we believe that users will only run 1 process per core).  An 
extra safety margin would need to be added on top.  Or maybe we need 
something a lot higher?


Question: Would there be any negative side effect of setting fs.file-max 
to a very large number (10s of millions)?


Interestingly, the (possibly outdated) Large Cluster Administration Guide 
at https://slurm.schedmd.com/big_sys.html recommends a ridiculously low 
number:


/proc/sys/fs/file-max: The maximum number of concurrently open files. We 
recommend a limit of at least 32,832.


Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.


The ulimit is a frontend to rusage limits, which are per-process 
restrictions (not per-user).


The fs.file-max is the kernel's limit on how many file descriptors can 
be open in aggregate.  You'd have to edit that with sysctl:



    *$ sysctl fs.file-max*
    fs.file-max = 26161449



Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an 
alternative limit versus the default.






But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it i

[slurm-users] Integrating Slurm with WekaIO

2024-04-19 Thread Jeffrey Layton via slurm-users
Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
Command Manager which is based on Bright Cluster Manager). I ran into an
error and only just learned that Slurm and Weka don't get along (presumably
because Weka pins their client threads to cores). I read through their
documentation here:
https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

I through I set everything correctly but when I try to restart the slurm
server I get the following:

Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
result 'exit-code'.

Has anyone encountered this?

I read this is usually associated with configless Slurm, but I don't know
how Slurm is built in BCM. slurm.conf is located in
/cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
environment variables for Slurm are set correctly so it points to this
slurm.conf file.

One thing that I did not do was tell Slurm which cores Weka was using. I
can seem to figure out the syntax for this. Can someone share the changes
they made to slurm.conf?

Thanks!

Jeff

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
This is because you have no slurm.conf in /etc/slurm, so it it is trying 
'configless' which queries DNS to find out where to get the config. It 
is failing because you do not have DNS configured to tell nodes where to 
ask about the config.


Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).

Brian Andrus

On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:

Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 
(Base Command Manager which is based on Bright Cluster Manager). I ran 
into an error and only just learned that Slurm and Weka don't get 
along (presumably because Weka pins their client threads to cores). I 
read through their documentation here: 
https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8


I through I set everything correctly but when I try to restart the 
slurm server I get the following:


Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: 
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: 
fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: 
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd 
initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: 
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: 
DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: 
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd 
initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main 
process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with 
result 'exit-code'.


Has anyone encountered this?

I read this is usually associated with configless Slurm, but I don't 
know how Slurm is built in BCM. slurm.conf is located in 
/cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The 
environment variables for Slurm are set correctly so it points to this 
slurm.conf file.


One thing that I did not do was tell Slurm which cores Weka was using. 
I can seem to figure out the syntax for this. Can someone share the 
changes they made to slurm.conf?


Thanks!

Jeff


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
>
> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>
For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including
on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the
correct path.



> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>
> Good afternoon,
>
> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
> Command Manager which is based on Bright Cluster Manager). I ran into an
> error and only just learned that Slurm and Weka don't get along (presumably
> because Weka pins their client threads to cores). I read through their
> documentation here:
> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
> 
>
> I through I set everything correctly but when I try to restart the slurm
> server I get the following:
>
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> fetch_config: DNS SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
> SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
> exited, code=exited, status=1/FAILURE
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
> result 'exit-code'.
>
> Has anyone encountered this?
>
> I read this is usually associated with configless Slurm, but I don't know
> how Slurm is built in BCM. slurm.conf is located in
> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
> environment variables for Slurm are set correctly so it points to this
> slurm.conf file.
>
> One thing that I did not do was tell Slurm which cores Weka was using. I
> can seem to figure out the syntax for this. Can someone share the changes
> they made to slurm.conf?
>
> Thanks!
>
> Jeff
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Jeffrey Layton via slurm-users
I like it, however, it was working before without a slurm.conf in
/etc/slurm.

Plus the environment variable SLURM_CONF is pointing to the correct
slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?

Thanks!

Jeff


On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> This is because you have no slurm.conf in /etc/slurm, so it it is trying
> 'configless' which queries DNS to find out where to get the config. It is
> failing because you do not have DNS configured to tell nodes where to ask
> about the config.
>
> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>
> Brian Andrus
> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>
> Good afternoon,
>
> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
> Command Manager which is based on Bright Cluster Manager). I ran into an
> error and only just learned that Slurm and Weka don't get along (presumably
> because Weka pins their client threads to cores). I read through their
> documentation here:
> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
>
> I through I set everything correctly but when I try to restart the slurm
> server I get the following:
>
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> fetch_config: DNS SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
> SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
> exited, code=exited, status=1/FAILURE
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
> result 'exit-code'.
>
> Has anyone encountered this?
>
> I read this is usually associated with configless Slurm, but I don't know
> how Slurm is built in BCM. slurm.conf is located in
> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
> environment variables for Slurm are set correctly so it points to this
> slurm.conf file.
>
> One thing that I did not do was tell Slurm which cores Weka was using. I
> can seem to figure out the syntax for this. Can someone share the changes
> they made to slurm.conf?
>
> Thanks!
>
> Jeff
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
I would double-check where you are setting SLURM_CONF then. It is acting 
as if it is not set (typo maybe?)


It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).

Also check what the final, actual command being run to start it is. If 
anyone has changed the .service file or added an override file, that 
will affect things.


Brian Andrus


On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
I like it, however, it was working before without a slurm.conf in 
/etc/slurm.


Plus the environment variable SLURM_CONF is pointing to the correct 
slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?


Thanks!

Jeff


On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users 
 wrote:


This is because you have no slurm.conf in /etc/slurm, so it it is
trying 'configless' which queries DNS to find out where to get the
config. It is failing because you do not have DNS configured to
tell nodes where to ask about the config.

Simple solution: put a copy of slurm.conf in /etc/slurm/ on the
node(s).

Brian Andrus

On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:

Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM
10 (Base Command Manager which is based on Bright Cluster
Manager). I ran into an error and only just learned that Slurm
and Weka don't get along (presumably because Weka pins their
client threads to cores). I read through their documentation
here:

https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

I through I set everything correctly but when I try to restart
the slurm server I get the following:

Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main
process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed
with result 'exit-code'.

Has anyone encountered this?

I read this is usually associated with configless Slurm, but I
don't know how Slurm is built in BCM. slurm.conf is located in
/cm/shared/apps/slurm/var/etc/slurm and this is what I edited.
The environment variables for Slurm are set correctly so it
points to this slurm.conf file.

One thing that I did not do was tell Slurm which cores Weka was
using. I can seem to figure out the syntax for this. Can someone
share the changes they made to slurm.conf?

Thanks!

Jeff




-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
On Bright it's set in a few places:
grep -r -i SLURM_CONF /etc
/etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/logrotate.d/slurmdbd.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/logrotate.d/slurm.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/pull.pl:$ENV{'SLURM_CONF'} =
'/cm/shared/apps/slurm/var/etc/slurm/slurm.conf';

It'd still be good to check on a compute node what echo $SLURM_CONF returns
for you.

On Fri, Apr 19, 2024 at 1:50 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> I would double-check where you are setting SLURM_CONF then. It is acting
> as if it is not set (typo maybe?)
>
> It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).
>
> Also check what the final, actual command being run to start it is. If
> anyone has changed the .service file or added an override file, that will
> affect things.
>
> Brian Andrus
>
>
> On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
>
> I like it, however, it was working before without a slurm.conf in
> /etc/slurm.
>
> Plus the environment variable SLURM_CONF is pointing to the correct
> slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?
>
> Thanks!
>
> Jeff
>
>
> On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> This is because you have no slurm.conf in /etc/slurm, so it it is trying
>> 'configless' which queries DNS to find out where to get the config. It is
>> failing because you do not have DNS configured to tell nodes where to ask
>> about the config.
>>
>> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>>
>> Brian Andrus
>> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>>
>> Good afternoon,
>>
>> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
>> Command Manager which is based on Bright Cluster Manager). I ran into an
>> error and only just learned that Slurm and Weka don't get along (presumably
>> because Weka pins their client threads to cores). I read through their
>> documentation here:
>> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
>> 
>>
>> I through I set everything correctly but when I try to restart the slurm
>> server I get the following:
>>
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> fetch_config: DNS SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
>> SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
>> result 'exit-code'.
>>
>> Has anyone encountered this?
>>
>> I read this is usually associated with configless Slurm, but I don't know
>> how Slurm is built in BCM. slurm.conf is located in
>> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
>> environment variables for Slurm are set correctly so it points to this
>> slurm.conf file.
>>
>> One thing that I did not do was tell Slurm which cores Weka was using. I
>> can seem to figure out the syntax for this. Can someone share the changes
>> they made to slurm.conf?
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an e

[slurm-users] any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-04-19 Thread Robert Kudyba via slurm-users
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about
pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does
not appear to come by default with the Bright 'cm' package of Slurm.

Currently ssh to a node gets:
Login not allowed: no running jobs and no WLM allocations

We have 8 GPUs on a node so when we drain a node, which can have up to a 5
day job, no new jobs can run. And since we have 20+ TB (yes TB) local
drives, researchers have their work and files on them to retrieve.

Is there a way to use /etc/security/access.conf to work around this at
least temporarily until the reboot and then we can revert?

Thanks!

Rob

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com