I like it, however, it was working before without a slurm.conf in /etc/slurm.
Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one? Thanks! Jeff On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > This is because you have no slurm.conf in /etc/slurm, so it it is trying > 'configless' which queries DNS to find out where to get the config. It is > failing because you do not have DNS configured to tell nodes where to ask > about the config. > > Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). > > Brian Andrus > On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote: > > Good afternoon, > > I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base > Command Manager which is based on Bright Cluster Manager). I ran into an > error and only just learned that Slurm and Weka don't get along (presumably > because Weka pins their client threads to cores). I read through their > documentation here: > https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8 > > I through I set everything correctly but when I try to restart the slurm > server I get the following: > > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: > resolve_ctls_from_dns_srv: res_nsearch error: Unknown host > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: > fetch_config: DNS SRV lookup failed > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: > _establish_configuration: failed to load configs > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd > initialization failed > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: > resolve_ctls_from_dns_srv: res_nsearch error: Unknown host > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS > SRV lookup failed > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: > _establish_configuration: failed to load configs > Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd > initialization failed > Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process > exited, code=exited, status=1/FAILURE > Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with > result 'exit-code'. > > Has anyone encountered this? > > I read this is usually associated with configless Slurm, but I don't know > how Slurm is built in BCM. slurm.conf is located in > /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The > environment variables for Slurm are set correctly so it points to this > slurm.conf file. > > One thing that I did not do was tell Slurm which cores Weka was using. I > can seem to figure out the syntax for this. Can someone share the changes > they made to slurm.conf? > > Thanks! > > Jeff > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com