In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affecting all my software sanity :D )
Here is the my slurm.conf. # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=anyone ControlAddr=master #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=openmpi MpiParams=ports=12000-12999 #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=2 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/tmp SwitchType=switch/none #TaskEpilog= TaskPlugin=task/cgroup #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=60 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=43200 #OverTimeLimit=0 SlurmctldTimeout=600 SlurmdTimeout=600 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING DefMemPerCPU=1000 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill #SchedulerPort= SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/filetxt #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= JobCompLoc=/var/log/slurm-llnl/JobComp.log #JobCompPass= #JobCompPort= JobCompType=jobcomp/filetxt #JobCompUser= JobAcctGatherFrequency=60 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <mini...@gmail.com>: > Are you trying to start the slurmd in the headnode or a compute node? > > Can you provide the slurm.conf file? > > Regards, > Carlos > > On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene < > e.faliv...@ilabroma.com> wrote: > >> slurmd -Dvvv says >> >> slurmd: fatal: Unable to determine this slurmd's NodeName >> >> b >> >> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>: >> >>> The fact that sinfo is responding shows that at least slurmctld is >>> running. Slumd, on the other hand is not. Please also get output of >>> slurmd log or running "slurmd -Dvvv" >>> >> >> >> >> >>> >>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.faliv...@ilabroma.com> >>> wrote: >>> >>>> > Anyway I suggest to update the operating system to stretch and fix >>>> your >>>> > configuration under a more recent version of slurm. >>>> >>>> I think I'll soon arrive to that :) >>>> b >>>> >>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>: >>>> >>>>> Ciao Elisabetta, >>>>> >>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote: >>>>> > Error messages are not much helping me in guessing what is going on. >>>>> What >>>>> > should I check to get what is failing? >>>>> >>>>> check slurmctld.log and slurmd.log, you can find them under >>>>> /var/log/slurm-llnl >>>>> >>>>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* >>>>> > *batch* up infinite 8 unk* node[01-08]* >>>>> > >>>>> > >>>>> > Running >>>>> > *systemctl status slurmctld.service* >>>>> > >>>>> > returns >>>>> > >>>>> > *slurmctld.service - Slurm controller daemon* >>>>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* >>>>> > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 >>>>> CET; 41s >>>>> > ago* >>>>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >>>>> > (code=exited, status=0/SUCCESS)* >>>>> > >>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure* >>>>> > * slurmctld[2100]: cons_res: select_p_node_init* >>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions* >>>>> > * slurmctld[2100]: Running as primary controller* >>>>> > * slurmctld[2100]: >>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma >>>>> x_sched_time=4,partition_job_depth=0* >>>>> > * slurmctld.service start operation timed out. Terminating.* >>>>> > *Terminate signal (SIGINT or SIGTERM) received* >>>>> > * slurmctld[2100]: Saving all slurm state* >>>>> > * Failed to start Slurm controller daemon.* >>>>> > * Unit slurmctld.service entered failed state.* >>>>> >>>>> Do you have a backup controller? >>>>> Check your slurm.conf under: >>>>> /etc/slurm-llnl >>>>> >>>>> Anyway I suggest to update the operating system to stretch and fix your >>>>> configuration under a more recent version of slurm. >>>>> Best regards >>>>> -- >>>>> Gennaro Oliva >>>>> >>>>> >>>> >> > > > -- > -- > Carles Fenoy >