On Bright it's set in a few places: grep -r -i SLURM_CONF /etc /etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /etc/systemd/system/slurmd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /etc/logrotate.d/slurmdbd.rpmsave: SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null /etc/logrotate.d/slurm.rpmsave: SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null /etc/pull.pl:$ENV{'SLURM_CONF'} = '/cm/shared/apps/slurm/var/etc/slurm/slurm.conf';
It'd still be good to check on a compute node what echo $SLURM_CONF returns for you. On Fri, Apr 19, 2024 at 1:50 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > I would double-check where you are setting SLURM_CONF then. It is acting > as if it is not set (typo maybe?) > > It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd). > > Also check what the final, actual command being run to start it is. If > anyone has changed the .service file or added an override file, that will > affect things. > > Brian Andrus > > > On 4/19/2024 10:15 AM, Jeffrey Layton wrote: > > I like it, however, it was working before without a slurm.conf in > /etc/slurm. > > Plus the environment variable SLURM_CONF is pointing to the correct > slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one? > > Thanks! > > Jeff > > > On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> This is because you have no slurm.conf in /etc/slurm, so it it is trying >> 'configless' which queries DNS to find out where to get the config. It is >> failing because you do not have DNS configured to tell nodes where to ask >> about the config. >> >> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). >> >> Brian Andrus >> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote: >> >> Good afternoon, >> >> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base >> Command Manager which is based on Bright Cluster Manager). I ran into an >> error and only just learned that Slurm and Weka don't get along (presumably >> because Weka pins their client threads to cores). I read through their >> documentation here: >> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.weka.io_best-2Dpractice-2Dguides_weka-2Dand-2Dslurm-2Dintegration-23heading-2Dh.4d34og8&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=HolQC1xGoR-p4R1duAGttv6NbEaPFaRlxXzPr1yfgk0SY8qhxsVYUpsKVCU8Jx40&s=oB4SXQ1y6QuN_yKu51e36NH-0FvapOlYIUnPjRyNTbE&e=> >> >> I through I set everything correctly but when I try to restart the slurm >> server I get the following: >> >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: >> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: >> fetch_config: DNS SRV lookup failed >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: >> _establish_configuration: failed to load configs >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd >> initialization failed >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: >> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS >> SRV lookup failed >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: >> _establish_configuration: failed to load configs >> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd >> initialization failed >> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process >> exited, code=exited, status=1/FAILURE >> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with >> result 'exit-code'. >> >> Has anyone encountered this? >> >> I read this is usually associated with configless Slurm, but I don't know >> how Slurm is built in BCM. slurm.conf is located in >> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The >> environment variables for Slurm are set correctly so it points to this >> slurm.conf file. >> >> One thing that I did not do was tell Slurm which cores Weka was using. I >> can seem to figure out the syntax for this. Can someone share the changes >> they made to slurm.conf? >> >> Thanks! >> >> Jeff >> >> >> >> -- >> slurm-users mailing list -- slurm-users@lists.schedmd.com >> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >> > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com