slurmd -Dvvv says slurmd: fatal: Unable to determine this slurmd's NodeName
b 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>: > The fact that sinfo is responding shows that at least slurmctld is > running. Slumd, on the other hand is not. Please also get output of > slurmd log or running "slurmd -Dvvv" > > > On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.faliv...@ilabroma.com> > wrote: > >> > Anyway I suggest to update the operating system to stretch and fix your >> > configuration under a more recent version of slurm. >> >> I think I'll soon arrive to that :) >> b >> >> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>: >> >>> Ciao Elisabetta, >>> >>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote: >>> > Error messages are not much helping me in guessing what is going on. >>> What >>> > should I check to get what is failing? >>> >>> check slurmctld.log and slurmd.log, you can find them under >>> /var/log/slurm-llnl >>> >>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* >>> > *batch* up infinite 8 unk* node[01-08]* >>> > >>> > >>> > Running >>> > *systemctl status slurmctld.service* >>> > >>> > returns >>> > >>> > *slurmctld.service - Slurm controller daemon* >>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* >>> > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 >>> CET; 41s >>> > ago* >>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >>> > (code=exited, status=0/SUCCESS)* >>> > >>> > * slurmctld[2100]: cons_res: select_p_reconfigure* >>> > * slurmctld[2100]: cons_res: select_p_node_init* >>> > * slurmctld[2100]: cons_res: preparing for 1 partitions* >>> > * slurmctld[2100]: Running as primary controller* >>> > * slurmctld[2100]: >>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma >>> x_sched_time=4,partition_job_depth=0* >>> > * slurmctld.service start operation timed out. Terminating.* >>> > *Terminate signal (SIGINT or SIGTERM) received* >>> > * slurmctld[2100]: Saving all slurm state* >>> > * Failed to start Slurm controller daemon.* >>> > * Unit slurmctld.service entered failed state.* >>> >>> Do you have a backup controller? >>> Check your slurm.conf under: >>> /etc/slurm-llnl >>> >>> Anyway I suggest to update the operating system to stretch and fix your >>> configuration under a more recent version of slurm. >>> Best regards >>> -- >>> Gennaro Oliva >>> >>> >>