> It seems like the pidfile in systemd and slurm.conf are different. Check > if they are the same and if not adjust the slurm.conf pid files. That > should prevent systemd from killing slurm. > Emh, sorry, how I can do this?
> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, <e.faliv...@ilabroma.com> > wrote: > >> The deeper I go in the problem, the worser it seems... but maybe I'm a >> step closer to the solution. >> >> I discovered that munge was disabled on the nodes (my fault, Gennaro >> pointed out the problem before, but I enabled it back only on the master). >> Btw, it's very strange that the wheezy->jessie upgrade disabled munge on >> all nodes and master... >> >> Unfortunately, re-enabling munge on the nodes, didn't made slurmd start >> again. >> >> Maybe filling this setting could give me some info about the problem? >> *#SlurmdLogFile=* >> >> Thank you very much for your help. Is very precious to me. >> betta >> >> Ps: some test I made -> >> >> Running on the nodes >> >> *slurm -Dvvv* >> >> returns >> >> *slurmd: debug2: hwloc_topology_init* >> *slurmd: debug2: hwloc_topology_load* >> *slurmd: Considering each NUMA node as a socket* >> *slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 >> ThreadsPerCore:1* >> *slurmd: Node configuration differs from hardware: CPUs=16:16(hw) >> Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) >> ThreadsPerCore=1:1(hw)* >> *slurmd: topology NONE plugin loaded* >> *slurmd: Gathering cpu frequency information for 16 cpus* >> *slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf* >> *slurmd: debug2: hwloc_topology_init* >> *slurmd: debug2: hwloc_topology_load* >> *slurmd: Considering each NUMA node as a socket* >> *slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 >> ThreadsPerCore:1* >> *slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf* >> *slurmd: debug: task/cgroup: now constraining jobs allocated cores* >> *slurmd: task/cgroup: loaded* >> *slurmd: auth plugin for Munge (http://code.google.com/p/munge/ >> <http://code.google.com/p/munge/>) loaded* >> *slurmd: debug: spank: opening plugin stack >> /etc/slurm-llnl/plugstack.conf* >> *slurmd: Munge cryptographic signature plugin loaded* >> *slurmd: Warning: Core limit is only 0 KB* >> *slurmd: slurmd version 14.03.9 started* >> *slurmd: Job accounting gather LINUX plugin loaded* >> *slurmd: debug: job_container none plugin loaded* >> *slurmd: switch NONE plugin loaded* >> *slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100* >> *slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 >> TmpDisk=40189 Uptime=1254* >> *slurmd: AcctGatherEnergy NONE plugin loaded* >> *slurmd: AcctGatherProfile NONE plugin loaded* >> *slurmd: AcctGatherInfiniband NONE plugin loaded* >> *slurmd: AcctGatherFilesystem NONE plugin loaded* >> *slurmd: debug2: No acct_gather.conf file >> (/etc/slurm-llnl/acct_gather.conf)* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *^Cslurmd: got shutdown request* >> *slurmd: waiting on 1 active threads* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *©©slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: debug2: _slurm_connect failed: Connection refused* >> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 >> <http://192.168.1.1:6817>: Connection refused* >> *slurmd: debug: Failed to contact primary controller: Connection refused* >> *slurmd: error: Unable to register: Unable to contact slurm controller >> (connect failure)* >> *slurmd: debug: Unable to register with slurm controller, retrying* >> *slurmd: all threads complete* >> *slurmd: Consumable Resources (CR) Node Selection plugin shutting down >> ...* >> *slurmd: Munge cryptographic signature plugin unloaded* >> *slurmd: Slurmd shutdown completing* >> >> which maybe it is not so bad as it seems for it may only point out that >> slurm is not up on the master, isn't? >> >> On the master running >> >> *service slurmctld restart* >> >> returns >> >> *Job for slurmctld.service failed. See 'systemctl status >> slurmctld.service' and 'journalctl -xn' for details.* >> >> and >> >> *service slurmctld status* >> >> *returns* >> >> *slurmctld.service - Slurm controller daemon* >> * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* >> * Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET; >> 44s ago* >> * Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >> (code=exited, status=0/SUCCESS)* >> >> * slurmctld[2225]: cons_res: select_p_reconfigure* >> * slurmctld[2225]: cons_res: select_p_node_init* >> * slurmctld[2225]: cons_res: preparing for 1 partitions* >> * slurmctld[2225]: Running as primary controller* >> * slurmctld[2225]: >> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0* >> * systemd[1]: slurmctld.service start operation timed out. Terminating.* >> * slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received* >> * slurmctld[2225]: Saving all slurm state* >> * systemd[1]: Failed to start Slurm controller daemon.* >> * systemd[1]: Unit slurmctld.service entered failed state.* >> >> and >> *journalctl -xn* >> >> returns no visible error >> >> *-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15 >> 18:17:33 CET. --* >> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it> >> slurmctld[2286]: Saving all slurm state* >> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it> >> systemd[1]: Failed to start Slurm controller daemon.* >> *-- Subject: Unit slurmctld.service has failed* >> *-- Defined-By: systemd* >> *-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel >> <http://lists.freedesktop.org/mailman/listinfo/systemd-devel>* >> *-- * >> *-- Unit slurmctld.service has failed.* >> *-- * >> *-- The result is failed.* >> * systemd[1]: Unit slurmctld.service entered failed state.* >> * CRON[2312]: pam_unix(cron:session): session opened for user root by >> (uid=0)* >> *CRON[2313]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)* >> *CRON[2312]: pam_unix(cron:session): session closed for user root* >> *dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via >> eth1* >> * dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1* >> *dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via >> eth1* >> *dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1* >> >> 2018-01-15 16:56 GMT+01:00 Carlos Fenoy <mini...@gmail.com>: >> >>> Hi, >>> >>> you can not start the slurmd on the headnode. Try running the same >>> command on the compute nodes and check the output. If there is any issue it >>> should display the reason. >>> >>> Regards, >>> Carlos >>> >>> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene < >>> e.faliv...@ilabroma.com> wrote: >>> >>>> In the headnode. (I'm also noticing, and seems good to tell, for maybe >>>> the problem is the same, even ldap is not working as expected giving a >>>> message "invalid credential (49)" which is a message given when there are >>>> problem of this type. The update to jessie must have touched something that >>>> is affecting all my software sanity :D ) >>>> >>>> Here is the my slurm.conf. >>>> >>>> # slurm.conf file generated by configurator.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ControlMachine=anyone >>>> ControlAddr=master >>>> #BackupController= >>>> #BackupAddr= >>>> # >>>> AuthType=auth/munge >>>> CacheGroups=0 >>>> #CheckpointType=checkpoint/none >>>> CryptoType=crypto/munge >>>> #DisableRootJobs=NO >>>> #EnforcePartLimits=NO >>>> #Epilog= >>>> #EpilogSlurmctld= >>>> #FirstJobId=1 >>>> #MaxJobId=999999 >>>> #GresTypes= >>>> #GroupUpdateForce=0 >>>> #GroupUpdateTime=600 >>>> #JobCheckpointDir=/var/slurm/checkpoint >>>> #JobCredentialPrivateKey= >>>> #JobCredentialPublicCertificate= >>>> #JobFileAppend=0 >>>> #JobRequeue=1 >>>> #JobSubmitPlugins=1 >>>> #KillOnBadExit=0 >>>> #Licenses=foo*4,bar >>>> #MailProg=/bin/mail >>>> #MaxJobCount=5000 >>>> #MaxStepCount=40000 >>>> #MaxTasksPerNode=128 >>>> MpiDefault=openmpi >>>> MpiParams=ports=12000-12999 >>>> #PluginDir= >>>> #PlugStackConfig= >>>> #PrivateData=jobs >>>> ProctrackType=proctrack/cgroup >>>> #Prolog= >>>> #PrologSlurmctld= >>>> #PropagatePrioProcess=0 >>>> #PropagateResourceLimits= >>>> #PropagateResourceLimitsExcept= >>>> ReturnToService=2 >>>> #SallocDefaultCommand= >>>> SlurmctldPidFile=/var/run/slurmctld.pid >>>> SlurmctldPort=6817 >>>> SlurmdPidFile=/var/run/slurmd.pid >>>> SlurmdPort=6818 >>>> SlurmdSpoolDir=/tmp/slurmd >>>> SlurmUser=slurm >>>> #SlurmdUser=root >>>> #SrunEpilog= >>>> #SrunProlog= >>>> StateSaveLocation=/tmp >>>> SwitchType=switch/none >>>> #TaskEpilog= >>>> TaskPlugin=task/cgroup >>>> #TaskPluginParam= >>>> #TaskProlog= >>>> #TopologyPlugin=topology/tree >>>> #TmpFs=/tmp >>>> #TrackWCKey=no >>>> #TreeWidth= >>>> #UnkillableStepProgram= >>>> #UsePAM=0 >>>> # >>>> # >>>> # TIMERS >>>> #BatchStartTimeout=10 >>>> #CompleteWait=0 >>>> #EpilogMsgTime=2000 >>>> #GetEnvTimeout=2 >>>> #HealthCheckInterval=0 >>>> #HealthCheckProgram= >>>> InactiveLimit=0 >>>> KillWait=60 >>>> #MessageTimeout=10 >>>> #ResvOverRun=0 >>>> MinJobAge=43200 >>>> #OverTimeLimit=0 >>>> SlurmctldTimeout=600 >>>> SlurmdTimeout=600 >>>> #UnkillableStepTimeout=60 >>>> #VSizeFactor=0 >>>> Waittime=0 >>>> # >>>> # >>>> # SCHEDULING >>>> DefMemPerCPU=1000 >>>> FastSchedule=1 >>>> #MaxMemPerCPU=0 >>>> #SchedulerRootFilter=1 >>>> #SchedulerTimeSlice=30 >>>> SchedulerType=sched/backfill >>>> #SchedulerPort= >>>> SelectType=select/cons_res >>>> SelectTypeParameters=CR_CPU_Memory >>>> # >>>> # >>>> # JOB PRIORITY >>>> #PriorityType=priority/basic >>>> #PriorityDecayHalfLife= >>>> #PriorityCalcPeriod= >>>> #PriorityFavorSmall= >>>> #PriorityMaxAge= >>>> #PriorityUsageResetPeriod= >>>> #PriorityWeightAge= >>>> #PriorityWeightFairshare= >>>> #PriorityWeightJobSize= >>>> #PriorityWeightPartition= >>>> #PriorityWeightQOS= >>>> # >>>> # >>>> # LOGGING AND ACCOUNTING >>>> #AccountingStorageEnforce=0 >>>> #AccountingStorageHost= >>>> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log >>>> #AccountingStoragePass= >>>> #AccountingStoragePort= >>>> AccountingStorageType=accounting_storage/filetxt >>>> #AccountingStorageUser= >>>> AccountingStoreJobComment=YES >>>> ClusterName=cluster >>>> #DebugFlags= >>>> #JobCompHost= >>>> JobCompLoc=/var/log/slurm-llnl/JobComp.log >>>> #JobCompPass= >>>> #JobCompPort= >>>> JobCompType=jobcomp/filetxt >>>> #JobCompUser= >>>> JobAcctGatherFrequency=60 >>>> JobAcctGatherType=jobacct_gather/linux >>>> SlurmctldDebug=3 >>>> #SlurmctldLogFile= >>>> SlurmdDebug=3 >>>> #SlurmdLogFile= >>>> #SlurmSchedLogFile= >>>> #SlurmSchedLogLevel= >>>> # >>>> # >>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional) >>>> #SuspendProgram= >>>> #ResumeProgram= >>>> #SuspendTimeout= >>>> #ResumeTimeout= >>>> #ResumeRate= >>>> #SuspendExcNodes= >>>> #SuspendExcParts= >>>> #SuspendRate= >>>> #SuspendTime= >>>> # >>>> # >>>> # COMPUTE NODES >>>> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN >>>> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE >>>> State=UP >>>> >>>> >>>> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <mini...@gmail.com>: >>>> >>>>> Are you trying to start the slurmd in the headnode or a compute node? >>>>> >>>>> Can you provide the slurm.conf file? >>>>> >>>>> Regards, >>>>> Carlos >>>>> >>>>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene < >>>>> e.faliv...@ilabroma.com> wrote: >>>>> >>>>>> slurmd -Dvvv says >>>>>> >>>>>> slurmd: fatal: Unable to determine this slurmd's NodeName >>>>>> >>>>>> b >>>>>> >>>>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>: >>>>>> >>>>>>> The fact that sinfo is responding shows that at least slurmctld is >>>>>>> running. Slumd, on the other hand is not. Please also get output of >>>>>>> slurmd log or running "slurmd -Dvvv" >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" < >>>>>>> e.faliv...@ilabroma.com> wrote: >>>>>>> >>>>>>>> > Anyway I suggest to update the operating system to stretch and >>>>>>>> fix your >>>>>>>> > configuration under a more recent version of slurm. >>>>>>>> >>>>>>>> I think I'll soon arrive to that :) >>>>>>>> b >>>>>>>> >>>>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>: >>>>>>>> >>>>>>>>> Ciao Elisabetta, >>>>>>>>> >>>>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene >>>>>>>>> wrote: >>>>>>>>> > Error messages are not much helping me in guessing what is going >>>>>>>>> on. What >>>>>>>>> > should I check to get what is failing? >>>>>>>>> >>>>>>>>> check slurmctld.log and slurmd.log, you can find them under >>>>>>>>> /var/log/slurm-llnl >>>>>>>>> >>>>>>>>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* >>>>>>>>> > *batch* up infinite 8 unk* node[01-08]* >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Running >>>>>>>>> > *systemctl status slurmctld.service* >>>>>>>>> > >>>>>>>>> > returns >>>>>>>>> > >>>>>>>>> > *slurmctld.service - Slurm controller daemon* >>>>>>>>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service; >>>>>>>>> enabled)* >>>>>>>>> > * Active: failed (Result: timeout) since Mon 2018-01-15 >>>>>>>>> 13:03:39 CET; 41s >>>>>>>>> > ago* >>>>>>>>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >>>>>>>>> > (code=exited, status=0/SUCCESS)* >>>>>>>>> > >>>>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure* >>>>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init* >>>>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions* >>>>>>>>> > * slurmctld[2100]: Running as primary controller* >>>>>>>>> > * slurmctld[2100]: >>>>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0, >>>>>>>>> max_sched_time=4,partition_job_depth=0* >>>>>>>>> > * slurmctld.service start operation timed out. Terminating.* >>>>>>>>> > *Terminate signal (SIGINT or SIGTERM) received* >>>>>>>>> > * slurmctld[2100]: Saving all slurm state* >>>>>>>>> > * Failed to start Slurm controller daemon.* >>>>>>>>> > * Unit slurmctld.service entered failed state.* >>>>>>>>> >>>>>>>>> Do you have a backup controller? >>>>>>>>> Check your slurm.conf under: >>>>>>>>> /etc/slurm-llnl >>>>>>>>> >>>>>>>>> Anyway I suggest to update the operating system to stretch and fix >>>>>>>>> your >>>>>>>>> configuration under a more recent version of slurm. >>>>>>>>> Best regards >>>>>>>>> -- >>>>>>>>> Gennaro Oliva >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Carles Fenoy >>>>> >>>> >>>> >>> >>> >>> -- >>> -- >>> Carles Fenoy >>> >> >>