I would double-check where you are setting SLURM_CONF then. It is acting as if it is not set (typo maybe?)

It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).

Also check what the final, actual command being run to start it is. If anyone has changed the .service file or added an override file, that will affect things.

Brian Andrus


On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
I like it, however, it was working before without a slurm.conf in /etc/slurm.

Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?

Thanks!

Jeff


On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <slurm-users@lists.schedmd.com> wrote:

    This is because you have no slurm.conf in /etc/slurm, so it it is
    trying 'configless' which queries DNS to find out where to get the
    config. It is failing because you do not have DNS configured to
    tell nodes where to ask about the config.

    Simple solution: put a copy of slurm.conf in /etc/slurm/ on the
    node(s).

    Brian Andrus

    On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
    Good afternoon,

    I'm working on a cluster of NVIDIA DGX A100's that is using BCM
    10 (Base Command Manager which is based on Bright Cluster
    Manager). I ran into an error and only just learned that Slurm
    and Weka don't get along (presumably because Weka pins their
    client threads to cores). I read through their documentation
    here:
    
https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

    I through I set everything correctly but when I try to restart
    the slurm server I get the following:

    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
    resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
    fetch_config: DNS SRV lookup failed
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
    _establish_configuration: failed to load configs
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
    slurmd initialization failed
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
    resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
    fetch_config: DNS SRV lookup failed
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
    _establish_configuration: failed to load configs
    Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
    initialization failed
    Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main
    process exited, code=exited, status=1/FAILURE
    Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed
    with result 'exit-code'.

    Has anyone encountered this?

    I read this is usually associated with configless Slurm, but I
    don't know how Slurm is built in BCM. slurm.conf is located in
    /cm/shared/apps/slurm/var/etc/slurm and this is what I edited.
    The environment variables for Slurm are set correctly so it
    points to this slurm.conf file.

    One thing that I did not do was tell Slurm which cores Weka was
    using. I can seem to figure out the syntax for this. Can someone
    share the changes they made to slurm.conf?

    Thanks!

    Jeff



-- slurm-users mailing list -- slurm-users@lists.schedmd.com
    To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to