On Bright it's set in a few places:
grep -r -i SLURM_CONF /etc
/etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/logrotate.d/slurmdbd.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/logrotate.d/slurm.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/pull.pl:$ENV{'SLURM_CONF'} =
'/cm/shared/apps/slurm/var/etc/slurm/slurm.conf';

It'd still be good to check on a compute node what echo $SLURM_CONF returns
for you.

On Fri, Apr 19, 2024 at 1:50 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> I would double-check where you are setting SLURM_CONF then. It is acting
> as if it is not set (typo maybe?)
>
> It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).
>
> Also check what the final, actual command being run to start it is. If
> anyone has changed the .service file or added an override file, that will
> affect things.
>
> Brian Andrus
>
>
> On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
>
> I like it, however, it was working before without a slurm.conf in
> /etc/slurm.
>
> Plus the environment variable SLURM_CONF is pointing to the correct
> slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?
>
> Thanks!
>
> Jeff
>
>
> On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> This is because you have no slurm.conf in /etc/slurm, so it it is trying
>> 'configless' which queries DNS to find out where to get the config. It is
>> failing because you do not have DNS configured to tell nodes where to ask
>> about the config.
>>
>> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>>
>> Brian Andrus
>> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>>
>> Good afternoon,
>>
>> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
>> Command Manager which is based on Bright Cluster Manager). I ran into an
>> error and only just learned that Slurm and Weka don't get along (presumably
>> because Weka pins their client threads to cores). I read through their
>> documentation here:
>> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.weka.io_best-2Dpractice-2Dguides_weka-2Dand-2Dslurm-2Dintegration-23heading-2Dh.4d34og8&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=HolQC1xGoR-p4R1duAGttv6NbEaPFaRlxXzPr1yfgk0SY8qhxsVYUpsKVCU8Jx40&s=oB4SXQ1y6QuN_yKu51e36NH-0FvapOlYIUnPjRyNTbE&e=>
>>
>> I through I set everything correctly but when I try to restart the slurm
>> server I get the following:
>>
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> fetch_config: DNS SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
>> SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
>> result 'exit-code'.
>>
>> Has anyone encountered this?
>>
>> I read this is usually associated with configless Slurm, but I don't know
>> how Slurm is built in BCM. slurm.conf is located in
>> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
>> environment variables for Slurm are set correctly so it points to this
>> slurm.conf file.
>>
>> One thing that I did not do was tell Slurm which cores Weka was using. I
>> can seem to figure out the syntax for this. Can someone share the changes
>> they made to slurm.conf?
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to