The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution.
I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade disabled munge on all nodes and master... Unfortunately, re-enabling munge on the nodes, didn't made slurmd start again. Maybe filling this setting could give me some info about the problem? *#SlurmdLogFile=* Thank you very much for your help. Is very precious to me. betta Ps: some test I made -> Running on the nodes *slurm -Dvvv* returns *slurmd: debug2: hwloc_topology_init* *slurmd: debug2: hwloc_topology_load* *slurmd: Considering each NUMA node as a socket* *slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1* *slurmd: Node configuration differs from hardware: CPUs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)* *slurmd: topology NONE plugin loaded* *slurmd: Gathering cpu frequency information for 16 cpus* *slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf* *slurmd: debug2: hwloc_topology_init* *slurmd: debug2: hwloc_topology_load* *slurmd: Considering each NUMA node as a socket* *slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1* *slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf* *slurmd: debug: task/cgroup: now constraining jobs allocated cores* *slurmd: task/cgroup: loaded* *slurmd: auth plugin for Munge (http://code.google.com/p/munge/ <http://code.google.com/p/munge/>) loaded* *slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf* *slurmd: Munge cryptographic signature plugin loaded* *slurmd: Warning: Core limit is only 0 KB* *slurmd: slurmd version 14.03.9 started* *slurmd: Job accounting gather LINUX plugin loaded* *slurmd: debug: job_container none plugin loaded* *slurmd: switch NONE plugin loaded* *slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100* *slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 TmpDisk=40189 Uptime=1254* *slurmd: AcctGatherEnergy NONE plugin loaded* *slurmd: AcctGatherProfile NONE plugin loaded* *slurmd: AcctGatherInfiniband NONE plugin loaded* *slurmd: AcctGatherFilesystem NONE plugin loaded* *slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *^Cslurmd: got shutdown request* *slurmd: waiting on 1 active threads* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *©©slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: debug2: _slurm_connect failed: Connection refused* *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused* *slurmd: debug: Failed to contact primary controller: Connection refused* *slurmd: error: Unable to register: Unable to contact slurm controller (connect failure)* *slurmd: debug: Unable to register with slurm controller, retrying* *slurmd: all threads complete* *slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...* *slurmd: Munge cryptographic signature plugin unloaded* *slurmd: Slurmd shutdown completing* which maybe it is not so bad as it seems for it may only point out that slurm is not up on the master, isn't? On the master running *service slurmctld restart* returns *Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details.* and *service slurmctld status* *returns* *slurmctld.service - Slurm controller daemon* * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* * Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET; 44s ago* * Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)* * slurmctld[2225]: cons_res: select_p_reconfigure* * slurmctld[2225]: cons_res: select_p_node_init* * slurmctld[2225]: cons_res: preparing for 1 partitions* * slurmctld[2225]: Running as primary controller* * slurmctld[2225]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0* * systemd[1]: slurmctld.service start operation timed out. Terminating.* * slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received* * slurmctld[2225]: Saving all slurm state* * systemd[1]: Failed to start Slurm controller daemon.* * systemd[1]: Unit slurmctld.service entered failed state.* and *journalctl -xn* returns no visible error *-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15 18:17:33 CET. --* *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it> slurmctld[2286]: Saving all slurm state* *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it> systemd[1]: Failed to start Slurm controller daemon.* *-- Subject: Unit slurmctld.service has failed* *-- Defined-By: systemd* *-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel <http://lists.freedesktop.org/mailman/listinfo/systemd-devel>* *-- * *-- Unit slurmctld.service has failed.* *-- * *-- The result is failed.* * systemd[1]: Unit slurmctld.service entered failed state.* * CRON[2312]: pam_unix(cron:session): session opened for user root by (uid=0)* *CRON[2313]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)* *CRON[2312]: pam_unix(cron:session): session closed for user root* *dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via eth1* * dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1* *dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via eth1* *dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1* 2018-01-15 16:56 GMT+01:00 Carlos Fenoy <mini...@gmail.com>: > Hi, > > you can not start the slurmd on the headnode. Try running the same command > on the compute nodes and check the output. If there is any issue it should > display the reason. > > Regards, > Carlos > > On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene < > e.faliv...@ilabroma.com> wrote: > >> In the headnode. (I'm also noticing, and seems good to tell, for maybe >> the problem is the same, even ldap is not working as expected giving a >> message "invalid credential (49)" which is a message given when there are >> problem of this type. The update to jessie must have touched something that >> is affecting all my software sanity :D ) >> >> Here is the my slurm.conf. >> >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> ControlMachine=anyone >> ControlAddr=master >> #BackupController= >> #BackupAddr= >> # >> AuthType=auth/munge >> CacheGroups=0 >> #CheckpointType=checkpoint/none >> CryptoType=crypto/munge >> #DisableRootJobs=NO >> #EnforcePartLimits=NO >> #Epilog= >> #EpilogSlurmctld= >> #FirstJobId=1 >> #MaxJobId=999999 >> #GresTypes= >> #GroupUpdateForce=0 >> #GroupUpdateTime=600 >> #JobCheckpointDir=/var/slurm/checkpoint >> #JobCredentialPrivateKey= >> #JobCredentialPublicCertificate= >> #JobFileAppend=0 >> #JobRequeue=1 >> #JobSubmitPlugins=1 >> #KillOnBadExit=0 >> #Licenses=foo*4,bar >> #MailProg=/bin/mail >> #MaxJobCount=5000 >> #MaxStepCount=40000 >> #MaxTasksPerNode=128 >> MpiDefault=openmpi >> MpiParams=ports=12000-12999 >> #PluginDir= >> #PlugStackConfig= >> #PrivateData=jobs >> ProctrackType=proctrack/cgroup >> #Prolog= >> #PrologSlurmctld= >> #PropagatePrioProcess=0 >> #PropagateResourceLimits= >> #PropagateResourceLimitsExcept= >> ReturnToService=2 >> #SallocDefaultCommand= >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmctldPort=6817 >> SlurmdPidFile=/var/run/slurmd.pid >> SlurmdPort=6818 >> SlurmdSpoolDir=/tmp/slurmd >> SlurmUser=slurm >> #SlurmdUser=root >> #SrunEpilog= >> #SrunProlog= >> StateSaveLocation=/tmp >> SwitchType=switch/none >> #TaskEpilog= >> TaskPlugin=task/cgroup >> #TaskPluginParam= >> #TaskProlog= >> #TopologyPlugin=topology/tree >> #TmpFs=/tmp >> #TrackWCKey=no >> #TreeWidth= >> #UnkillableStepProgram= >> #UsePAM=0 >> # >> # >> # TIMERS >> #BatchStartTimeout=10 >> #CompleteWait=0 >> #EpilogMsgTime=2000 >> #GetEnvTimeout=2 >> #HealthCheckInterval=0 >> #HealthCheckProgram= >> InactiveLimit=0 >> KillWait=60 >> #MessageTimeout=10 >> #ResvOverRun=0 >> MinJobAge=43200 >> #OverTimeLimit=0 >> SlurmctldTimeout=600 >> SlurmdTimeout=600 >> #UnkillableStepTimeout=60 >> #VSizeFactor=0 >> Waittime=0 >> # >> # >> # SCHEDULING >> DefMemPerCPU=1000 >> FastSchedule=1 >> #MaxMemPerCPU=0 >> #SchedulerRootFilter=1 >> #SchedulerTimeSlice=30 >> SchedulerType=sched/backfill >> #SchedulerPort= >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU_Memory >> # >> # >> # JOB PRIORITY >> #PriorityType=priority/basic >> #PriorityDecayHalfLife= >> #PriorityCalcPeriod= >> #PriorityFavorSmall= >> #PriorityMaxAge= >> #PriorityUsageResetPeriod= >> #PriorityWeightAge= >> #PriorityWeightFairshare= >> #PriorityWeightJobSize= >> #PriorityWeightPartition= >> #PriorityWeightQOS= >> # >> # >> # LOGGING AND ACCOUNTING >> #AccountingStorageEnforce=0 >> #AccountingStorageHost= >> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log >> #AccountingStoragePass= >> #AccountingStoragePort= >> AccountingStorageType=accounting_storage/filetxt >> #AccountingStorageUser= >> AccountingStoreJobComment=YES >> ClusterName=cluster >> #DebugFlags= >> #JobCompHost= >> JobCompLoc=/var/log/slurm-llnl/JobComp.log >> #JobCompPass= >> #JobCompPort= >> JobCompType=jobcomp/filetxt >> #JobCompUser= >> JobAcctGatherFrequency=60 >> JobAcctGatherType=jobacct_gather/linux >> SlurmctldDebug=3 >> #SlurmctldLogFile= >> SlurmdDebug=3 >> #SlurmdLogFile= >> #SlurmSchedLogFile= >> #SlurmSchedLogLevel= >> # >> # >> # POWER SAVE SUPPORT FOR IDLE NODES (optional) >> #SuspendProgram= >> #ResumeProgram= >> #SuspendTimeout= >> #ResumeTimeout= >> #ResumeRate= >> #SuspendExcNodes= >> #SuspendExcParts= >> #SuspendRate= >> #SuspendTime= >> # >> # >> # COMPUTE NODES >> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN >> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE >> State=UP >> >> >> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <mini...@gmail.com>: >> >>> Are you trying to start the slurmd in the headnode or a compute node? >>> >>> Can you provide the slurm.conf file? >>> >>> Regards, >>> Carlos >>> >>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene < >>> e.faliv...@ilabroma.com> wrote: >>> >>>> slurmd -Dvvv says >>>> >>>> slurmd: fatal: Unable to determine this slurmd's NodeName >>>> >>>> b >>>> >>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>: >>>> >>>>> The fact that sinfo is responding shows that at least slurmctld is >>>>> running. Slumd, on the other hand is not. Please also get output of >>>>> slurmd log or running "slurmd -Dvvv" >>>>> >>>> >>>> >>>> >>>> >>>>> >>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.faliv...@ilabroma.com> >>>>> wrote: >>>>> >>>>>> > Anyway I suggest to update the operating system to stretch and fix >>>>>> your >>>>>> > configuration under a more recent version of slurm. >>>>>> >>>>>> I think I'll soon arrive to that :) >>>>>> b >>>>>> >>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>: >>>>>> >>>>>>> Ciao Elisabetta, >>>>>>> >>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote: >>>>>>> > Error messages are not much helping me in guessing what is going >>>>>>> on. What >>>>>>> > should I check to get what is failing? >>>>>>> >>>>>>> check slurmctld.log and slurmd.log, you can find them under >>>>>>> /var/log/slurm-llnl >>>>>>> >>>>>>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* >>>>>>> > *batch* up infinite 8 unk* node[01-08]* >>>>>>> > >>>>>>> > >>>>>>> > Running >>>>>>> > *systemctl status slurmctld.service* >>>>>>> > >>>>>>> > returns >>>>>>> > >>>>>>> > *slurmctld.service - Slurm controller daemon* >>>>>>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service; >>>>>>> enabled)* >>>>>>> > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 >>>>>>> CET; 41s >>>>>>> > ago* >>>>>>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >>>>>>> > (code=exited, status=0/SUCCESS)* >>>>>>> > >>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure* >>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init* >>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions* >>>>>>> > * slurmctld[2100]: Running as primary controller* >>>>>>> > * slurmctld[2100]: >>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma >>>>>>> x_sched_time=4,partition_job_depth=0* >>>>>>> > * slurmctld.service start operation timed out. Terminating.* >>>>>>> > *Terminate signal (SIGINT or SIGTERM) received* >>>>>>> > * slurmctld[2100]: Saving all slurm state* >>>>>>> > * Failed to start Slurm controller daemon.* >>>>>>> > * Unit slurmctld.service entered failed state.* >>>>>>> >>>>>>> Do you have a backup controller? >>>>>>> Check your slurm.conf under: >>>>>>> /etc/slurm-llnl >>>>>>> >>>>>>> Anyway I suggest to update the operating system to stretch and fix >>>>>>> your >>>>>>> configuration under a more recent version of slurm. >>>>>>> Best regards >>>>>>> -- >>>>>>> Gennaro Oliva >>>>>>> >>>>>>> >>>>>> >>>> >>> >>> >>> -- >>> -- >>> Carles Fenoy >>> >> >> > > > -- > -- > Carles Fenoy >