I like it, however, it was working before without a slurm.conf in
/etc/slurm.

Plus the environment variable SLURM_CONF is pointing to the correct
slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?

Thanks!

Jeff


On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> This is because you have no slurm.conf in /etc/slurm, so it it is trying
> 'configless' which queries DNS to find out where to get the config. It is
> failing because you do not have DNS configured to tell nodes where to ask
> about the config.
>
> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>
> Brian Andrus
> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>
> Good afternoon,
>
> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
> Command Manager which is based on Bright Cluster Manager). I ran into an
> error and only just learned that Slurm and Weka don't get along (presumably
> because Weka pins their client threads to cores). I read through their
> documentation here:
> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
>
> I through I set everything correctly but when I try to restart the slurm
> server I get the following:
>
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> fetch_config: DNS SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
> SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
> exited, code=exited, status=1/FAILURE
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
> result 'exit-code'.
>
> Has anyone encountered this?
>
> I read this is usually associated with configless Slurm, but I don't know
> how Slurm is built in BCM. slurm.conf is located in
> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
> environment variables for Slurm are set correctly so it points to this
> slurm.conf file.
>
> One thing that I did not do was tell Slurm which cores Weka was using. I
> can seem to figure out the syntax for this. Can someone share the changes
> they made to slurm.conf?
>
> Thanks!
>
> Jeff
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to