Re: [slurm-users] Slurm not starting

Elisabetta Falivene Tue, 16 Jan 2018 04:28:44 -0800

> It seems like the pidfile in systemd and slurm.conf are different. Check
> if they are the same and if not adjust the slurm.conf pid files. That
> should prevent systemd from killing slurm.
>
Emh, sorry, how I can do this?




> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, <e.faliv...@ilabroma.com>
> wrote:
>
>> The deeper I go in the problem, the worser it seems... but maybe I'm a
>> step closer to the solution.
>>
>> I discovered that munge was disabled on the nodes (my fault, Gennaro
>> pointed out the problem before, but I enabled it back only on the master).
>> Btw, it's very strange that the wheezy->jessie upgrade disabled munge on
>> all nodes and master...
>>
>> Unfortunately, re-enabling munge on the nodes, didn't made slurmd start
>> again.
>>
>> Maybe filling this setting could give me some info about the problem?
>> *#SlurmdLogFile=*
>>
>> Thank you very much for your help. Is very precious to me.
>> betta
>>
>> Ps: some test I made ->
>>
>> Running on the nodes
>>
>> *slurm -Dvvv*
>>
>> returns
>>
>> *slurmd: debug2: hwloc_topology_init*
>> *slurmd: debug2: hwloc_topology_load*
>> *slurmd: Considering each NUMA node as a socket*
>> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
>> ThreadsPerCore:1*
>> *slurmd: Node configuration differs from hardware: CPUs=16:16(hw)
>> Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw)
>> ThreadsPerCore=1:1(hw)*
>> *slurmd: topology NONE plugin loaded*
>> *slurmd: Gathering cpu frequency information for 16 cpus*
>> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
>> *slurmd: debug2: hwloc_topology_init*
>> *slurmd: debug2: hwloc_topology_load*
>> *slurmd: Considering each NUMA node as a socket*
>> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
>> ThreadsPerCore:1*
>> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
>> *slurmd: debug:  task/cgroup: now constraining jobs allocated cores*
>> *slurmd: task/cgroup: loaded*
>> *slurmd: auth plugin for Munge (http://code.google.com/p/munge/
>> <http://code.google.com/p/munge/>) loaded*
>> *slurmd: debug:  spank: opening plugin stack
>> /etc/slurm-llnl/plugstack.conf*
>> *slurmd: Munge cryptographic signature plugin loaded*
>> *slurmd: Warning: Core limit is only 0 KB*
>> *slurmd: slurmd version 14.03.9 started*
>> *slurmd: Job accounting gather LINUX plugin loaded*
>> *slurmd: debug:  job_container none plugin loaded*
>> *slurmd: switch NONE plugin loaded*
>> *slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100*
>> *slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999
>> TmpDisk=40189 Uptime=1254*
>> *slurmd: AcctGatherEnergy NONE plugin loaded*
>> *slurmd: AcctGatherProfile NONE plugin loaded*
>> *slurmd: AcctGatherInfiniband NONE plugin loaded*
>> *slurmd: AcctGatherFilesystem NONE plugin loaded*
>> *slurmd: debug2: No acct_gather.conf file
>> (/etc/slurm-llnl/acct_gather.conf)*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *^Cslurmd: got shutdown request*
>> *slurmd: waiting on 1 active threads*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *©©slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
>> <http://192.168.1.1:6817>: Connection refused*
>> *slurmd: debug:  Failed to contact primary controller: Connection refused*
>> *slurmd: error: Unable to register: Unable to contact slurm controller
>> (connect failure)*
>> *slurmd: debug:  Unable to register with slurm controller, retrying*
>> *slurmd: all threads complete*
>> *slurmd: Consumable Resources (CR) Node Selection plugin shutting down
>> ...*
>> *slurmd: Munge cryptographic signature plugin unloaded*
>> *slurmd: Slurmd shutdown completing*
>>
>> which maybe it is not so bad as it seems for it may only point out that
>> slurm is not up on the master, isn't?
>>
>> On the master running
>>
>> *service slurmctld restart*
>>
>> returns
>>
>> *Job for slurmctld.service failed. See 'systemctl status
>> slurmctld.service' and 'journalctl -xn' for details.*
>>
>> and
>>
>> *service slurmctld status*
>>
>> *returns*
>>
>> *slurmctld.service - Slurm controller daemon*
>> *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
>> *   Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET;
>> 44s ago*
>> *  Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>> (code=exited, status=0/SUCCESS)*
>>
>> * slurmctld[2225]: cons_res: select_p_reconfigure*
>> * slurmctld[2225]: cons_res: select_p_node_init*
>> * slurmctld[2225]: cons_res: preparing for 1 partitions*
>> * slurmctld[2225]: Running as primary controller*
>> * slurmctld[2225]:
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
>> * systemd[1]: slurmctld.service start operation timed out. Terminating.*
>> * slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received*
>> * slurmctld[2225]: Saving all slurm state*
>> * systemd[1]: Failed to start Slurm controller daemon.*
>> * systemd[1]: Unit slurmctld.service entered failed state.*
>>
>> and
>> *journalctl -xn*
>>
>> returns no visible error
>>
>> *-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15
>> 18:17:33 CET. --*
>> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
>> slurmctld[2286]: Saving all slurm state*
>> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
>> systemd[1]: Failed to start Slurm controller daemon.*
>> *-- Subject: Unit slurmctld.service has failed*
>> *-- Defined-By: systemd*
>> *-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>> <http://lists.freedesktop.org/mailman/listinfo/systemd-devel>*
>> *-- *
>> *-- Unit slurmctld.service has failed.*
>> *-- *
>> *-- The result is failed.*
>> * systemd[1]: Unit slurmctld.service entered failed state.*
>> * CRON[2312]: pam_unix(cron:session): session opened for user root by
>> (uid=0)*
>> *CRON[2313]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)*
>> *CRON[2312]: pam_unix(cron:session): session closed for user root*
>> *dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via
>> eth1*
>> * dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1*
>> *dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via
>> eth1*
>> *dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1*
>>
>> 2018-01-15 16:56 GMT+01:00 Carlos Fenoy <mini...@gmail.com>:
>>
>>> Hi,
>>>
>>> you can not start the slurmd on the headnode. Try running the same
>>> command on the compute nodes and check the output. If there is any issue it
>>> should display the reason.
>>>
>>> Regards,
>>> Carlos
>>>
>>> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
>>> e.faliv...@ilabroma.com> wrote:
>>>
>>>> In the headnode. (I'm also noticing, and seems good to tell, for maybe
>>>> the problem is the same, even ldap is not working as expected giving a
>>>> message "invalid credential (49)" which is a message given when there are
>>>> problem of this type. The update to jessie must have touched something that
>>>> is affecting all my software sanity :D )
>>>>
>>>> Here is the my slurm.conf.
>>>>
>>>> # slurm.conf file generated by configurator.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ControlMachine=anyone
>>>> ControlAddr=master
>>>> #BackupController=
>>>> #BackupAddr=
>>>> #
>>>> AuthType=auth/munge
>>>> CacheGroups=0
>>>> #CheckpointType=checkpoint/none
>>>> CryptoType=crypto/munge
>>>> #DisableRootJobs=NO
>>>> #EnforcePartLimits=NO
>>>> #Epilog=
>>>> #EpilogSlurmctld=
>>>> #FirstJobId=1
>>>> #MaxJobId=999999
>>>> #GresTypes=
>>>> #GroupUpdateForce=0
>>>> #GroupUpdateTime=600
>>>> #JobCheckpointDir=/var/slurm/checkpoint
>>>> #JobCredentialPrivateKey=
>>>> #JobCredentialPublicCertificate=
>>>> #JobFileAppend=0
>>>> #JobRequeue=1
>>>> #JobSubmitPlugins=1
>>>> #KillOnBadExit=0
>>>> #Licenses=foo*4,bar
>>>> #MailProg=/bin/mail
>>>> #MaxJobCount=5000
>>>> #MaxStepCount=40000
>>>> #MaxTasksPerNode=128
>>>> MpiDefault=openmpi
>>>> MpiParams=ports=12000-12999
>>>> #PluginDir=
>>>> #PlugStackConfig=
>>>> #PrivateData=jobs
>>>> ProctrackType=proctrack/cgroup
>>>> #Prolog=
>>>> #PrologSlurmctld=
>>>> #PropagatePrioProcess=0
>>>> #PropagateResourceLimits=
>>>> #PropagateResourceLimitsExcept=
>>>> ReturnToService=2
>>>> #SallocDefaultCommand=
>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>> SlurmctldPort=6817
>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>> SlurmdPort=6818
>>>> SlurmdSpoolDir=/tmp/slurmd
>>>> SlurmUser=slurm
>>>> #SlurmdUser=root
>>>> #SrunEpilog=
>>>> #SrunProlog=
>>>> StateSaveLocation=/tmp
>>>> SwitchType=switch/none
>>>> #TaskEpilog=
>>>> TaskPlugin=task/cgroup
>>>> #TaskPluginParam=
>>>> #TaskProlog=
>>>> #TopologyPlugin=topology/tree
>>>> #TmpFs=/tmp
>>>> #TrackWCKey=no
>>>> #TreeWidth=
>>>> #UnkillableStepProgram=
>>>> #UsePAM=0
>>>> #
>>>> #
>>>> # TIMERS
>>>> #BatchStartTimeout=10
>>>> #CompleteWait=0
>>>> #EpilogMsgTime=2000
>>>> #GetEnvTimeout=2
>>>> #HealthCheckInterval=0
>>>> #HealthCheckProgram=
>>>> InactiveLimit=0
>>>> KillWait=60
>>>> #MessageTimeout=10
>>>> #ResvOverRun=0
>>>> MinJobAge=43200
>>>> #OverTimeLimit=0
>>>> SlurmctldTimeout=600
>>>> SlurmdTimeout=600
>>>> #UnkillableStepTimeout=60
>>>> #VSizeFactor=0
>>>> Waittime=0
>>>> #
>>>> #
>>>> # SCHEDULING
>>>> DefMemPerCPU=1000
>>>> FastSchedule=1
>>>> #MaxMemPerCPU=0
>>>> #SchedulerRootFilter=1
>>>> #SchedulerTimeSlice=30
>>>> SchedulerType=sched/backfill
>>>> #SchedulerPort=
>>>> SelectType=select/cons_res
>>>> SelectTypeParameters=CR_CPU_Memory
>>>> #
>>>> #
>>>> # JOB PRIORITY
>>>> #PriorityType=priority/basic
>>>> #PriorityDecayHalfLife=
>>>> #PriorityCalcPeriod=
>>>> #PriorityFavorSmall=
>>>> #PriorityMaxAge=
>>>> #PriorityUsageResetPeriod=
>>>> #PriorityWeightAge=
>>>> #PriorityWeightFairshare=
>>>> #PriorityWeightJobSize=
>>>> #PriorityWeightPartition=
>>>> #PriorityWeightQOS=
>>>> #
>>>> #
>>>> # LOGGING AND ACCOUNTING
>>>> #AccountingStorageEnforce=0
>>>> #AccountingStorageHost=
>>>> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
>>>> #AccountingStoragePass=
>>>> #AccountingStoragePort=
>>>> AccountingStorageType=accounting_storage/filetxt
>>>> #AccountingStorageUser=
>>>> AccountingStoreJobComment=YES
>>>> ClusterName=cluster
>>>> #DebugFlags=
>>>> #JobCompHost=
>>>> JobCompLoc=/var/log/slurm-llnl/JobComp.log
>>>> #JobCompPass=
>>>> #JobCompPort=
>>>> JobCompType=jobcomp/filetxt
>>>> #JobCompUser=
>>>> JobAcctGatherFrequency=60
>>>> JobAcctGatherType=jobacct_gather/linux
>>>> SlurmctldDebug=3
>>>> #SlurmctldLogFile=
>>>> SlurmdDebug=3
>>>> #SlurmdLogFile=
>>>> #SlurmSchedLogFile=
>>>> #SlurmSchedLogLevel=
>>>> #
>>>> #
>>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>>> #SuspendProgram=
>>>> #ResumeProgram=
>>>> #SuspendTimeout=
>>>> #ResumeTimeout=
>>>> #ResumeRate=
>>>> #SuspendExcNodes=
>>>> #SuspendExcParts=
>>>> #SuspendRate=
>>>> #SuspendTime=
>>>> #
>>>> #
>>>> # COMPUTE NODES
>>>> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
>>>> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE
>>>> State=UP
>>>>
>>>>
>>>> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <mini...@gmail.com>:
>>>>
>>>>> Are you trying to start the slurmd in the headnode or a compute node?
>>>>>
>>>>> Can you provide the slurm.conf file?
>>>>>
>>>>> Regards,
>>>>> Carlos
>>>>>
>>>>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
>>>>> e.faliv...@ilabroma.com> wrote:
>>>>>
>>>>>> slurmd -Dvvv says
>>>>>>
>>>>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>>>>
>>>>>> b
>>>>>>
>>>>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>:
>>>>>>
>>>>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>>>>> running.  Slumd, on the other hand is not.  Please also get output of
>>>>>>> slurmd log or running "slurmd -Dvvv"
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <
>>>>>>> e.faliv...@ilabroma.com> wrote:
>>>>>>>
>>>>>>>> > Anyway I suggest to update the operating system to stretch and
>>>>>>>> fix your
>>>>>>>> > configuration under a more recent version of slurm.
>>>>>>>>
>>>>>>>> I think I'll soon arrive to that :)
>>>>>>>> b
>>>>>>>>
>>>>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>:
>>>>>>>>
>>>>>>>>> Ciao Elisabetta,
>>>>>>>>>
>>>>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene
>>>>>>>>> wrote:
>>>>>>>>> > Error messages are not much helping me in guessing what is going
>>>>>>>>> on. What
>>>>>>>>> > should I check to get what is failing?
>>>>>>>>>
>>>>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>>>>> /var/log/slurm-llnl
>>>>>>>>>
>>>>>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Running
>>>>>>>>> > *systemctl status slurmctld.service*
>>>>>>>>> >
>>>>>>>>> > returns
>>>>>>>>> >
>>>>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>>>>> enabled)*
>>>>>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15
>>>>>>>>> 13:03:39 CET; 41s
>>>>>>>>> > ago*
>>>>>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>>>>> >
>>>>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>>>>> > * slurmctld[2100]:
>>>>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,
>>>>>>>>> max_sched_time=4,partition_job_depth=0*
>>>>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>>>>
>>>>>>>>> Do you have a backup controller?
>>>>>>>>> Check your slurm.conf under:
>>>>>>>>> /etc/slurm-llnl
>>>>>>>>>
>>>>>>>>> Anyway I suggest to update the operating system to stretch and fix
>>>>>>>>> your
>>>>>>>>> configuration under a more recent version of slurm.
>>>>>>>>> Best regards
>>>>>>>>> --
>>>>>>>>> Gennaro Oliva
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Carles Fenoy
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> Carles Fenoy
>>>
>>
>>

Re: [slurm-users] Slurm not starting

Reply via email to