> check slurmctld.log and slurmd.log, you can find them under > /var/log/slurm-llnl
I have not these files unfortunately. Even 'locate' can't find them. > Do you have a backup controller? > Check your slurm.conf under: > /etc/slurm-llnl seems not #BackupController= #BackupAddr= > /usr/sbin/slurmctld -d -vvv file not found :( I can start with /etc/init.d/slurmctld restart but I get [....] Restarting slurmctld (via systemctl): slurmctld.serviceJob for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details. failed! journalctl -xn -- Logs begin at Mon 2018-01-15 12:44:48 CET, end at Mon 2018-01-15 15:23:31 CET. -- Jan 15 15:23:00 anyone.phys.uniroma1.it slurmctld[5133]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0 dhcpd[1550]: DHCPREQUEST for 192.168.1.106 from 10:bf:48:1a:02:5e via eth1 dhcpd[1550]: DHCPACK on 192.168.1.106 to 10:bf:48:1a:02:5e via eth1 dhcpd[1550]: DHCPREQUEST for 192.168.1.104 from bc:ae:c5:12:97:3b via eth1 dhcpd[1550]: DHCPACK on 192.168.1.104 to bc:ae:c5:12:97:3b via eth1 systemd[1]: slurmctld.service start operation timed out. Terminating. slurmctld[5133]: Terminate signal (SIGINT or SIGTERM) received slurmctld[5133]: Saving all slurm state systemd[1]: Failed to start Slurm controller daemon. -- Subject: Unit slurmctld.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit slurmctld.service has failed. -- -- The result is failed. 2018-01-15 14:16 GMT+01:00 Williams, Jenny Avis <jen...@email.unc.edu>: > Elisabetta- > > Start by focusing on slurmctld. Slurmd not happy without it. > Start it manually in the foreground as in > /usr/sbin/slurmctld -d -vvv > > This assumes slurmd,conf is in default location. > Pardon brevity; on my phone > Jenny Williams > > > Sent from Nine <http://www.9folders.com/> > ------------------------------ > *From:* Elisabetta Falivene <e.faliv...@ilabroma.com> > *Sent:* Monday, January 15, 2018 7:14 AM > *To:* Slurm User Community List > *Subject:* [slurm-users] Slurm not starting > > I did an upgrade from wheezy to jessie (automatically with a normal > dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and > from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank > you vey much to Gennaro Oliva, btw), now the system is running correctly > with kernel 3.16.0.4, but slurm isn't starting. I tried restarting > services, but it seems it isn't able to do it. > > Error messages are not much helping me in guessing what is going on. What > should I check to get what is failing? > > Thank you > Elisabetta > > PS: Here it is some tests I did > > Running > *sinfo* > > returns > > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* > *batch* up infinite 8 unk* node[01-08]* > > > Running > *systemctl status slurmctld.service* > > returns > > *slurmctld.service - Slurm controller daemon* > * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; > 41s ago* > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS)* > > * slurmctld[2100]: cons_res: select_p_reconfigure* > * slurmctld[2100]: cons_res: select_p_node_init* > * slurmctld[2100]: cons_res: preparing for 1 partitions* > * slurmctld[2100]: Running as primary controller* > * slurmctld[2100]: > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0* > * slurmctld.service start operation timed out. Terminating.* > *Terminate signal (SIGINT or SIGTERM) received* > * slurmctld[2100]: Saving all slurm state* > * Failed to start Slurm controller daemon.* > * Unit slurmctld.service entered failed state.* > > and running > > */etc/init.d/slurmd status* > > returns > > *slurmd.service - Slurm node daemon* > * Loaded: loaded (/lib/systemd/system/slurmd.se <http://slurmd.se>rvice; > enabled)* > * Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET; > 21min ago* > * Process: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, > status=1/FAILURE)* > > *slurmd.service: control process exited, code=exited status=1* > *systemd[1]: Failed to start Slurm node daemon.* > *Unit slurmd.service entered failed state.* > > >