Re: [slurm-users] slurmctld daemon error

mercan Tue, 15 Dec 2020 11:12:59 -0800


Oh, yes! sorry, I confuse with the your email and Alpha Experiment's emails.

Ahmet M.


15.12.2020 21:59 tarihinde Avery Grieve yazdı:

Hi Ahmet,

Thank you for your suggestion. I assume you're talking about theSlurmctldHost field in the slurm.conf file? If so, I've got thatvariable defined as a hostname, not localhost.


Thanks,
Avery

On Tue, Dec 15, 2020, 1:51 PM mercan <ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr>> wrote:


    Hi;

    I dont know the problem is this, but, I think the setting
    "ControlMachine=localhost" and not setting a hostname for slurm
    master
    node are not good decisions. How compute nodes decide the ip
    address of
    the slurm masternode from "localhost". Also, I suggest not using
    capital
    letters for any thing related the slurm.

    Ahmet M.


    15.12.2020 21:15 tarihinde Avery Grieve yazdı:
    > I changed my .service file to write to a log. The slurm daemons are
    > running (manual start) on the compute nodes. I get this on startup
    > with the service enabled:
    >
    > [2020-12-15T18:09:06.412] slurmctld version 20.11.1 started on
    cluster
    > cluster
    > [2020-12-15T18:09:06.539] No memory enforcing mechanism configured.
    > [2020-12-15T18:09:06.572] error: get_addr_info: getaddrinfo()
    failed:
    > Name or service not known
    > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
    > "FireNode1"
    > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
    > not supported
    > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
    FireNode1
    > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
    failed:
    > Name or service not known
    > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
    > "FireNode2"
    > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
    > not supported
    > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
    FireNode2
    > [2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo()
    failed:
    > Name or service not known
    > [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
    > "FireNode3"
    > [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
    > not supported
    > [2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on
    FireNode3
    > [2020-12-15T18:09:06.578] Recovered state of 3 nodes
    > [2020-12-15T18:09:06.579] Recovered information about 0 jobs
    > [2020-12-15T18:09:06.582] Recovered state of 0 reservations
    > [2020-12-15T18:09:06.582] read_slurm_conf: backup_controller not
    specified
    > [2020-12-15T18:09:06.583] Running as primary controller
    > [2020-12-15T18:09:06.592] No parameter for mcs plugin, default
    values set
    > [2020-12-15T18:09:06.592] mcs: MCSParameters = (null). ondemand set.
    > [2020-12-15T18:09:06.595] error: get_addr_info: getaddrinfo()
    failed:
    > Name or service not known
    > [2020-12-15T18:09:06.595] error: slurm_set_addr: Unable to resolve
    > "(null)"
    > [2020-12-15T18:09:06.595] error: slurm_set_port: attempting to set
    > port without address family
    > [2020-12-15T18:09:06.603] error: Error creating slurm stream
    socket:
    > Address family not supported by protocol
    > [2020-12-15T18:09:06.603] fatal: slurm_init_msg_engine_port error
    > Address family not supported by protocol
    >
    > The main errors seem to be issues resolving host names and not
    being
    > able to set the port. My /etc/hosts file defines the FireNode[1-3]
    > host IPs and does not contain any IPv6 ips.
    >
    > My service file includes a clause for "after
    network-online.target" as
    > well.
    >
    > Now, I start the daemon with "systemctl start slurmctld" and end up
    > with the following log:
    >
    > [2020-12-15T18:14:03.448] slurmctld version 20.11.1 started on
    cluster
    > cluster
    > [2020-12-15T18:14:03.456] No memory enforcing mechanism configured.
    > [2020-12-15T18:14:03.465] Recovered state of 3 nodes
    > [2020-12-15T18:14:03.465] Recovered information about 0 jobs
    > [2020-12-15T18:14:03.465] Recovered state of 0 reservations
    > [2020-12-15T18:14:03.466] read_slurm_conf: backup_controller not
    specified
    > [2020-12-15T18:14:03.466] Running as primary controller
    > [2020-12-15T18:14:03.466] No parameter for mcs plugin, default
    values set
    > [2020-12-15T18:14:03.466] mcs: MCSParameters = (null). ondemand set.
    >
    > As you can see, it starts up fine. Seems like something is wrong
    > during the initial startup network stack configuration or
    something.
    > I'm not really sure where to look to begin troubleshooting these. A
    > bit of googling hasn't revealed much either unfortunately.
    >
    > Any advice?
    >
    > ~Avery Grieve
    > They/Them/Theirs please!
    > University of Michigan
    >
    >
    > On Tue, Dec 15, 2020 at 11:53 AM Avery Grieve <agri...@umich.edu
    <mailto:agri...@umich.edu>
    > <mailto:agri...@umich.edu <mailto:agri...@umich.edu>>> wrote:
    >
    >     Maybe a silly question, but where do you find the daemon logs or
    >     specify their location?
    >
    >     ~Avery Grieve
    >     They/Them/Theirs please!
    >     University of Michigan
    >
    >
    >     On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment
    >     <projectalpha...@gmail.com
    <mailto:projectalpha...@gmail.com>
    <mailto:projectalpha...@gmail.com
    <mailto:projectalpha...@gmail.com>>> wrote:
    >
    >         Hi,
    >
    >         I am trying to run slurm on Fedora 33. Upon boot the slurmd
    >         daemon is running correctly; however the slurmctld daemon
    >         always errors.
    >         [admin@localhost ~]$ systemctl status slurmd.service
    >         ● slurmd.service - Slurm node daemon
    >              Loaded: loaded (/etc/systemd/system/slurmd.service;
    >         enabled; vendor preset: disabled)
    >              Active: active (running) since Mon 2020-12-14 16:02:18
    >         PST; 11min ago
    >            Main PID: 2363 (slurmd)
    >               Tasks: 2
    >              Memory: 3.4M
    >                 CPU: 211ms
    >              CGroup: /system.slice/slurmd.service
    >                      └─2363 /usr/local/sbin/slurmd -D
    >         Dec 14 16:02:18 localhost.localdomain systemd[1]: Started
    >         Slurm node daemon.
    >         [admin@localhost ~]$ systemctl status slurmctld.service
    >         ● slurmctld.service - Slurm controller daemon
    >              Loaded: loaded (/etc/systemd/system/slurmctld.service;
    >         enabled; vendor preset: disabled)
    >             Drop-In: /etc/systemd/system/slurmctld.service.d
    >                      └─override.conf
    >              Active: failed (Result: exit-code) since Mon 2020-12-14
    >         16:02:12 PST; 11min ago
    >             Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
    >         $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
    >            Main PID: 1972 (code=exited, status=1/FAILURE)
    >                 CPU: 21ms
    >         Dec 14 16:02:12 localhost.localdomain systemd[1]: Started
    >         Slurm controller daemon.
    >         Dec 14 16:02:12 localhost.localdomain systemd[1]:
    >         slurmctld.service: Main process exited, code=exited,
    >         status=1/FAILURE
    >         Dec 14 16:02:12 localhost.localdomain systemd[1]:
    >         slurmctld.service: Failed with result 'exit-code'.
    >
    >         The slurmctld log is as follows:
    >         [2020-12-14T16:02:12.731] slurmctld version 20.11.1
    started on
    >         cluster cluster
    >         [2020-12-14T16:02:12.739] No memory enforcing mechanism
    >         configured.
    >         [2020-12-14T16:02:12.772] error: get_addr_info:
    getaddrinfo()
    >         failed: Name or service not known
    >         [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to
    >         resolve "localhost"
    >         [2020-12-14T16:02:12.772] error: slurm_get_port: Address
    >         family '0' not supported
    >         [2020-12-14T16:02:12.772] error: _set_slurmd_addr:
    failure on
    >         localhost
    >         [2020-12-14T16:02:12.772] Recovered state of 1 nodes
    >         [2020-12-14T16:02:12.772] Recovered information about 0 jobs
    >         [2020-12-14T16:02:12.772] select/cons_tres:
    >         part_data_create_array: select/cons_tres: preparing for 1
    >         partitions
    >         [2020-12-14T16:02:12.779] Recovered state of 0 reservations
    >         [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller
    >         not specified
    >         [2020-12-14T16:02:12.779] select/cons_tres:
    >         select_p_reconfigure: select/cons_tres: reconfigure
    >         [2020-12-14T16:02:12.779] select/cons_tres:
    >         part_data_create_array: select/cons_tres: preparing for 1
    >         partitions
    >         [2020-12-14T16:02:12.779] Running as primary controller
    >         [2020-12-14T16:02:12.780] No parameter for mcs plugin,
    default
    >         values set
    >         [2020-12-14T16:02:12.780] mcs: MCSParameters = (null).
    >         ondemand set.
    >         [2020-12-14T16:02:12.780] error: get_addr_info:
    getaddrinfo()
    >         failed: Name or service not known
    >         [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to
    >         resolve "(null)"
    >         [2020-12-14T16:02:12.780] error: slurm_set_port:
    attempting to
    >         set port without address family
    >         [2020-12-14T16:02:12.782] error: Error creating slurm stream
    >         socket: Address family not supported by protocol
    >         [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port
    >         error Address family not supported by protocol
    >
    >         Strangely, the daemon works fine when it is rebooted. After
    >         running
    >         systemctl restart slurmctld.service
    >
    >         the service status is
    >         [admin@localhost ~]$ systemctl status slurmctld.service
    >         ● slurmctld.service - Slurm controller daemon
    >              Loaded: loaded (/etc/systemd/system/slurmctld.service;
    >         enabled; vendor preset: disabled)
    >             Drop-In: /etc/systemd/system/slurmctld.service.d
    >                      └─override.conf
    >              Active: active (running) since Mon 2020-12-14 16:14:24
    >         PST; 3s ago
    >            Main PID: 2815 (slurmctld)
    >               Tasks: 7
    >              Memory: 1.9M
    >                 CPU: 15ms
    >              CGroup: /system.slice/slurmctld.service
    >                      └─2815 /usr/local/sbin/slurmctld -D
    >         Dec 14 16:14:24 localhost.localdomain systemd[1]: Started
    >         Slurm controller daemon.
    >
    >         Could anyone point me towards how to fix this? I expect it's
    >         just an issue with my configuration file, which I've copied
    >         below for reference.
    >         # slurm.conf file generated by configurator easy.html.
    >         # Put this file on all nodes of your cluster.
    >         # See the slurm.conf man page for more information.
    >         #
    >         #SlurmctldHost=localhost
    >         ControlMachine=localhost
    >         #
    >         #MailProg=/bin/mail
    >         MpiDefault=none
    >         #MpiParams=ports=#-#
    >         ProctrackType=proctrack/cgroup
    >         ReturnToService=1
    >  SlurmctldPidFile=/home/slurm/run/slurmctld.pid
    >         #SlurmctldPort=6817
    >         SlurmdPidFile=/home/slurm/run/slurmd.pid
    >         #SlurmdPort=6818
    >         SlurmdSpoolDir=/var/spool/slurm/slurmd/
    >         SlurmUser=slurm
    >         #SlurmdUser=root
    >         StateSaveLocation=/home/slurm/spool/
    >         SwitchType=switch/none
    >         TaskPlugin=task/affinity
    >         #
    >         #
    >         # TIMERS
    >         #KillWait=30
    >         #MinJobAge=300
    >         #SlurmctldTimeout=120
    >         #SlurmdTimeout=300
    >         #
    >         #
    >         # SCHEDULING
    >         SchedulerType=sched/backfill
    >         SelectType=select/cons_tres
    >         SelectTypeParameters=CR_Core
    >         #
    >         #
    >         # LOGGING AND ACCOUNTING
    >         AccountingStorageType=accounting_storage/none
    >         ClusterName=cluster
    >         #JobAcctGatherFrequency=30
    >         JobAcctGatherType=jobacct_gather/none
    >         #SlurmctldDebug=info
    >  SlurmctldLogFile=/home/slurm/log/slurmctld.log
    >         #SlurmdDebug=info
    >         #SlurmdLogFile=
    >         #
    >         #
    >         # COMPUTE NODES
    >         NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
    >         CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
    >         PartitionName=full Nodes=localhost Default=YES
    >         MaxTime=INFINITE State=UP
    >
    >         Thanks!
    >         -John
    >

Re: [slurm-users] slurmctld daemon error

Reply via email to