[slurm-users] jobs dropping

2024-10-25 Thread Mihai Ciubancan via slurm-users

Hello,

We are trying to run some PiconGPU codes on a machine with 8x100H, 
susing slurm. But the jobs don't run, and are not in the queue. In 
slurmd logs I have:


[2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memory 
extracted from credential for StepId=1079.batch job_mem_limit= 648000

[2024-10-24T09:50:40.934] Launching batch job 1079 for UID 1009
[2024-10-24T09:50:40.938] debug:  acct_gather_energy/none: init: 
AcctGatherEnergy NONE plugin loaded
[2024-10-24T09:50:40.938] debug:  acct_gather_profile/none: init: 
AcctGatherProfile NONE plugin loaded
[2024-10-24T09:50:40.938] debug:  acct_gather_interconnect/none: init: 
AcctGatherInterconnect NONE plugin loaded
[2024-10-24T09:50:40.938] debug:  acct_gather_filesystem/none: init: 
AcctGatherFilesystem NONE plugin loaded

[2024-10-24T09:50:40.939] debug:  gres/gpu: init: loaded
[2024-10-24T09:50:41.022] [1079.batch] debug:  cgroup/v2: init: Cgroup 
v2 plugin loaded
[2024-10-24T09:50:41.026] [1079.batch] debug:  CPUs:192 Boards:1 
Sockets:2 CoresPerSocket:48 ThreadsPerCore:2
[2024-10-24T09:50:41.026] [1079.batch] debug:  jobacct_gather/cgroup: 
init: Job accounting gather cgroup plugin loaded
[2024-10-24T09:50:41.026] [1079.batch] CPU_BIND: Memory extracted from 
credential for StepId=1079.batch job_mem_limit=648000 
step_mem_limit=648000
[2024-10-24T09:50:41.027] [1079.batch] debug:  laying out the 8 tasks on 
1 hosts mihaigpu2 dist 2
[2024-10-24T09:50:41.027] [1079.batch] gres_job_state gres:gpu(7696487) 
type:(null)(0) job:1079 flags:

[2024-10-24T09:50:41.027] [1079.batch]   total_gres:8
[2024-10-24T09:50:41.027] [1079.batch]   node_cnt:1
[2024-10-24T09:50:41.027] [1079.batch]   gres_cnt_node_alloc[0]:8
[2024-10-24T09:50:41.027] [1079.batch]   gres_bit_alloc[0]:0-7 of 8
[2024-10-24T09:50:41.027] [1079.batch] debug:  Message thread started 
pid = 459054
[2024-10-24T09:50:41.027] [1079.batch] debug:  Setting 
slurmstepd(459054) oom_score_adj to -1000
[2024-10-24T09:50:41.027] [1079.batch] debug:  switch/none: init: switch 
NONE plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] debug:  task/cgroup: init: core 
enforcement enabled
[2024-10-24T09:50:41.027] [1079.batch] debug:  task/cgroup: 
task_cgroup_memory_init: task/cgroup/memory: total:2063720M 
allowed:100%(enforced), swap:0%(permissive), max:100%(2063720M) 
max+swap:100%(4127440M) min:30M kmem:100%(2063720M permissive) min:30M
[2024-10-24T09:50:41.027] [1079.batch] debug:  task/cgroup: init: memory 
enforcement enabled
[2024-10-24T09:50:41.027] [1079.batch] debug:  task/cgroup: init: Tasks 
containment cgroup plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] cred/munge: init: Munge 
credential signature plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] debug:  job_container/none: init: 
job_container none plugin loaded
[2024-10-24T09:50:41.030] [1079.batch] debug:  spank: opening plugin 
stack /etc/slurm/plugstack.conf
[2024-10-24T09:50:41.030] [1079.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: job abstract cores are '0-63'
[2024-10-24T09:50:41.030] [1079.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: step abstract cores are '0-63'
[2024-10-24T09:50:41.030] [1079.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: job physical CPUs are 
'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'
[2024-10-24T09:50:41.030] [1079.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: step physical CPUs are 
'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'
[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize: 
job: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited
[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize: 
step: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited
[2024-10-24T09:50:41.064] [1079.batch] debug levels are stderr='error', 
logfile='debug', syslog='quiet'

[2024-10-24T09:50:41.064] [1079.batch] starting 1 tasks
[2024-10-24T09:50:41.064] [1079.batch] task 0 (459058) started 
2024-10-24T09:50:41
[2024-10-24T09:50:41.069] [1079.batch] _set_limit: RLIMIT_NOFILE : 
reducing req:1048576 to max:131072
[2024-10-24T09:51:23.066] debug:  _rpc_terminate_job: uid = 64030 
JobId=1079

[2024-10-24T09:51:23.067] debug:  credential for job 1079 revoked
[2024-10-24T09:51:23.067] [1079.batch] debug:  Handling 
REQUEST_SIGNAL_CONTAINER
[2024-10-24T09:51:23.067] [1079.batch] debug:  _handle_signal_container 
for StepId=1079.batch uid=64030 signal=18
[2024-10-24T09:51:23.068] [1079.batch] Sent signal 18 to 
StepId=1079.batch
[2024-10-24T09:51:23.068] [1079.batch] debug:  Handling 
REQUEST_SIGNAL_CONTAINER
[2024-10-24T09:51:23.068] [1079.batch] debug:  _handle_signal_container 
for StepI

[slurm-users] Scheduling oddity with multiple GPU types in same partition

2024-10-25 Thread Kevin M. Hildebrand via slurm-users
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs submitted for H100s won't
run even if there are idle H100s.  This is a small subset of our present
pending queue- the four bottom jobs should be running, but aren't.  The top
pending job shows reason 'Resources' while the rest all show 'Priority'.
Any thoughts on why this might be happening?

JOBID   PRIORITYTRES_ALLOC

8317749 501490
 cpu=4,mem=8M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1

8317750 501490
 cpu=4,mem=8M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1

8317745 501490
 cpu=4,mem=8M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1

8317746 501490
 cpu=4,mem=8M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1

8338679 500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1

8338678 500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1

8338677 500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1

8338676 500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1


Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland
Division of IT

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com