there is very very a strong likelyhood that you have configured SlurmdUser=slurm and one of the following 1) there is no /var/spool/slurmd folder 2) the /var/spool/slurmd folder exists but is owned by root
make sure it exists and is owned by whatever SlurmdUser is set to or change your SlurmdUser to run as root which may not be acceptable to you for security reasons but if you were to change this it makes "doing cool stuff" in prologs and epilogs easier as you can avoid complex paswordless sudo configs on all nodes. Antony On Wed, 13 Feb 2019 at 14:00, Nathalie Gocht <nathalie.go...@outlook.com> wrote: > Hey, > > > > I am building up a one node cluster. Master and node are n the same > machine. My slurm.conf: > > > > ControlMachine=bayes > > # > > MpiDefault=none > > ProctrackType=proctrack/pgid > > ReturnToService=1 > > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > > SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > > SlurmdPort=6818 > > SlurmdSpoolDir=/var/spool/slurmd > > SlurmUser=slurm > > StateSaveLocation=/var/spool/slurmctld > > SwitchType=switch/none > > TaskPlugin=task/none > > # > > # > > # TIMERS > > InactiveLimit=0 > > KillWait=30 > > MinJobAge=300 > > SlurmctldTimeout=120 > > SlurmdTimeout=300 > > Waittime=0 > > # > > # > > # SCHEDULING > > FastSchedule=1 > > SchedulerType=sched/builtin > > SelectType=select/linear > > # > > # > > # LOGGING AND ACCOUNTING > > AccountingStorageLoc=/var/log/slurm-llnl/job_accounting > > AccountingStorageType=accounting_storage/filetxt > > AccountingStoreJobComment=YES > > ClusterName=bayes > > JobCompLoc=/var/log/slurm-llnl/job_completion > > JobCompType=jobcomp/filetxt > > JobAcctGatherFrequency=60 > > JobAcctGatherType=jobacct_gather/linux > > SlurmctldDebug=info > > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > > SlurmdDebug=info > > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > > > > # COMPUTE NODES > > GresTypes=gpu > > > > NodeName=bayes Gres=gpu:tesla:1 CPUs=48 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=2 State=UNKNOWN > > PartitionName=long Nodes=bayes Default=YES MaxTime=INFINITE State=UP > > > > > > I started the control deamon, but get this information: > > $ systemctl status slurmctld.service > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor > preset: enabled) > > Active: failed (Result: exit-code) since Wed 2019-02-13 14:43:02 CET; > 7min ago > > Docs: man:slurmctld(8) > > Process: 40552 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCE > > Main PID: 40560 (code=exited, status=1/FAILURE) > > > > $ sinfo > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > > long* up infinite 1 idle bayes > > > > I tried to start the slurm deamon, but the timout exceeds. slurmd -Dvvv > gives: > > > > slurmd: error: chmod(/var/spool/slurmd, 0755): Operation not permitted > > slurmd: error: Unable to initialize slurmd spooldir > > slurmd: error: slurmd initialization failed > > > > Does someone know whats going on? > > > > Thanks, > > Nathalie >