Hi All, I am currently testing slurm version 19.05.3-2 on Centos 7 with one master and 3 nodes configuration. I used the same configuration that works on version 17.02.7 but for some reasons, it does not work on 19.05.3-2.
$ srun hostname srun: error: Unable to create step for job 19: Error generating job credential srun: Force Terminated job 19 If i run it as root, it works fine. $ sudo srun hostname piglet-18 Configuration: $ cat /etc/slurm/slurm.conf # Common ControlMachine=slurm-master ControlAddr=10.15.131.32 ClusterName=slurm-cluster RebootProgram="/usr/sbin/reboot" MailProg=/bin/mail ProctrackType=proctrack/cgroup ReturnToService=2 StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/cgroup # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/filetxt AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log JobAcctGatherType=jobacct_gather/cgroup # RESOURCES MemLimitEnforce=no ## Rack 1 NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000 TmpDisk=512000 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000 TmpDisk=512000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000 TmpDisk=512000 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=3 # Preempt PreemptMode=REQUEUE PreemptType=preempt/qos PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES State=UP PreemptMode=REQUEUE PriorityTier=10 Default=YES # TIMERS KillWait=30 MinJobAge=300 MessageTimeout=3 # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res #SelectTypeParameters=CR_Core_Memory SelectTypeParameters=CR_CPU_Memory DefMemPerCPU=128 # Limit MaxArraySize=201 # slurmctld SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmctldPidFile=/var/slurm/slurmctld.pid SlurmctldPort=6817 SlurmctldTimeout=60 SlurmUser=slurm # slurmd SlurmdDebug=5 SlurmdLogFile=/var/log/slurmd.log SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmdTimeout=300 # REQUEUE #RequeueExitHold=1-199,201-255 #RequeueExit=200 RequeueExitHold=201-255 RequeueExit=200 Slurmctld.log [2019-10-07T13:38:47.724] debug: sched: Running job scheduler [2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup failed: Unknown host [2019-10-07T13:38:49.255] sched: _slurm_rpc_allocate_resources JobId=19 NodeList=piglet-18 usec=959 [2019-10-07T13:38:49.259] debug: laying out the 1 tasks on 1 hosts piglet-18 dist 2 [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for uid=1000 [2019-10-07T13:38:49.260] error: slurm_cred_create error [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1 [2019-10-07T13:38:49.265] _job_complete: JobId=19 done [2019-10-07T13:38:49.270] debug: sched: Running job scheduler [2019-10-07T13:38:56.823] debug: sched: Running job scheduler [2019-10-07T13:39:13.504] debug: backfill: beginning [2019-10-07T13:39:13.504] debug: backfill: no jobs to backfill [2019-10-07T13:39:40.871] debug: Spawning ping agent for piglet-19 [2019-10-07T13:39:43.504] debug: backfill: beginning [2019-10-07T13:39:43.504] debug: backfill: no jobs to backfill [2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup failed: Unknown host [2019-10-07T13:39:47.001] sched: _slurm_rpc_allocate_resources JobId=20 NodeList=piglet-18 usec=979 [2019-10-07T13:39:47.005] debug: laying out the 1 tasks on 1 hosts piglet-18 dist 2 [2019-10-07T13:39:47.144] _job_complete: JobId=20 WEXITSTATUS 0 [2019-10-07T13:39:47.147] _job_complete: JobId=20 done [2019-10-07T13:39:47.158] debug: sched: Running job scheduler [2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup failed: Unknown host [2019-10-07T13:39:48.429] sched: _slurm_rpc_allocate_resources JobId=21 NodeList=piglet-18 usec=1114 [2019-10-07T13:39:48.434] debug: laying out the 1 tasks on 1 hosts piglet-18 dist 2 [2019-10-07T13:39:48.559] _job_complete: JobId=21 WEXITSTATUS 0 [2019-10-07T13:39:48.560] _job_complete: JobId=21 done slurmd.log on piglet-18 [2019-10-07T13:38:42.746] debug: _rpc_terminate_job, uid = 3001 [2019-10-07T13:38:42.747] debug: credential for job 17 revoked [2019-10-07T13:38:47.721] debug: _rpc_terminate_job, uid = 3001 [2019-10-07T13:38:47.722] debug: credential for job 18 revoked [2019-10-07T13:38:49.267] debug: _rpc_terminate_job, uid = 3001 [2019-10-07T13:38:49.268] debug: credential for job 19 revoked [2019-10-07T13:39:47.014] launch task 20.0 request from UID:0 GID:0 HOST:10.15.2.19 PORT:62137 [2019-10-07T13:39:47.014] debug: Checking credential with 404 bytes of sig data [2019-10-07T13:39:47.016] _run_prolog: run job script took usec=7 [2019-10-07T13:39:47.016] _run_prolog: prolog with lock for job 20 ran for 0 seconds [2019-10-07T13:39:47.026] debug: AcctGatherEnergy NONE plugin loaded [2019-10-07T13:39:47.026] debug: AcctGatherProfile NONE plugin loaded [2019-10-07T13:39:47.026] debug: AcctGatherInterconnect NONE plugin loaded [2019-10-07T13:39:47.026] debug: AcctGatherFilesystem NONE plugin loaded [2019-10-07T13:39:47.026] debug: switch NONE plugin loaded [2019-10-07T13:39:47.028] [20.0] debug: CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1 [2019-10-07T13:39:47.028] [20.0] debug: Job accounting gather cgroup plugin loaded [2019-10-07T13:39:47.028] [20.0] debug: cont_id hasn't been set yet not running poll [2019-10-07T13:39:47.029] [20.0] debug: Message thread started pid = 30331 [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: now constraining jobs allocated cores [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: loaded [2019-10-07T13:39:47.030] [20.0] debug: Checkpoint plugin loaded: checkpoint/none [2019-10-07T13:39:47.030] [20.0] Munge credential signature plugin loaded [2019-10-07T13:39:47.031] [20.0] debug: job_container none plugin loaded [2019-10-07T13:39:47.031] [20.0] debug: mpi type = none [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm' already exists [2019-10-07T13:39:47.031] [20.0] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2019-10-07T13:39:47.031] [20.0] debug: mpi type = (null) [2019-10-07T13:39:47.031] [20.0] debug: mpi/none: slurmstepd prefork [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm' already exists [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job abstract cores are '2' [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step abstract cores are '2' [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job physical cores are '4' [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step physical cores are '4' [2019-10-07T13:39:47.065] [20.0] debug level = 2 [2019-10-07T13:39:47.065] [20.0] starting 1 tasks [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started 2019-10-07T13:39:47 [2019-10-07T13:39:47.066] [20.0] debug: jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid 0 taskid 0 max_task_id 0 [2019-10-07T13:39:47.066] [20.0] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists [2019-10-07T13:39:47.067] [20.0] debug: jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0 taskid 0 max_task_id 0 [2019-10-07T13:39:47.067] [20.0] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists [2019-10-07T13:39:47.068] [20.0] debug: IO handler started pid=30331 [2019-10-07T13:39:47.099] [20.0] debug: jag_common_poll_data: Task 0 pid 30336 ave_freq = 1597534 mem size/max 0/0 vmem size/max 210853888/210853888, disk read size/max (0/0), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 [2019-10-07T13:39:47.101] [20.0] debug: mpi type = (null) [2019-10-07T13:39:47.101] [20.0] debug: Using mpi/none [2019-10-07T13:39:47.102] [20.0] debug: CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1 [2019-10-07T13:39:47.104] [20.0] debug: Sending launch resp rc=0 [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with exit code 0. [2019-10-07T13:39:47.139] [20.0] debug: step_terminate_monitor_stop signaling condition [2019-10-07T13:39:47.139] [20.0] debug: Waiting for IO [2019-10-07T13:39:47.140] [20.0] debug: Closing debug channel [2019-10-07T13:39:47.140] [20.0] debug: IO handler exited, rc=0 [2019-10-07T13:39:47.148] [20.0] debug: Message thread exited [2019-10-07T13:39:47.149] [20.0] done with job I am not sure what i am missing. Hope someone can point out what i am doing wrong here. Thank you. Best regards, Eddy Swan