[slurm-users] jobs dropping

Mihai Ciubancan via slurm-users Fri, 25 Oct 2024 01:02:07 -0700

Hello,

We are trying to run some PiconGPU codes on a machine with 8x100H,susing slurm. But the jobs don't run, and are not in the queue. Inslurmd logs I have:

[2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memoryextracted from credential for StepId=1079.batch job_mem_limit= 648000

[2024-10-24T09:50:40.934] Launching batch job 1079 for UID 1009

[2024-10-24T09:50:40.938] debug: acct_gather_energy/none: init:AcctGatherEnergy NONE plugin loaded[2024-10-24T09:50:40.938] debug: acct_gather_profile/none: init:AcctGatherProfile NONE plugin loaded[2024-10-24T09:50:40.938] debug: acct_gather_interconnect/none: init:AcctGatherInterconnect NONE plugin loaded[2024-10-24T09:50:40.938] debug: acct_gather_filesystem/none: init:AcctGatherFilesystem NONE plugin loaded

[2024-10-24T09:50:40.939] debug:  gres/gpu: init: loaded

[2024-10-24T09:50:41.022] [1079.batch] debug: cgroup/v2: init: Cgroupv2 plugin loaded[2024-10-24T09:50:41.026] [1079.batch] debug: CPUs:192 Boards:1Sockets:2 CoresPerSocket:48 ThreadsPerCore:2[2024-10-24T09:50:41.026] [1079.batch] debug: jobacct_gather/cgroup:init: Job accounting gather cgroup plugin loaded[2024-10-24T09:50:41.026] [1079.batch] CPU_BIND: Memory extracted fromcredential for StepId=1079.batch job_mem_limit=648000step_mem_limit=648000[2024-10-24T09:50:41.027] [1079.batch] debug: laying out the 8 tasks on1 hosts mihaigpu2 dist 2[2024-10-24T09:50:41.027] [1079.batch] gres_job_state gres:gpu(7696487)type:(null)(0) job:1079 flags:

[2024-10-24T09:50:41.027] [1079.batch]   total_gres:8
[2024-10-24T09:50:41.027] [1079.batch]   node_cnt:1
[2024-10-24T09:50:41.027] [1079.batch]   gres_cnt_node_alloc[0]:8
[2024-10-24T09:50:41.027] [1079.batch]   gres_bit_alloc[0]:0-7 of 8

[2024-10-24T09:50:41.027] [1079.batch] debug: Message thread startedpid = 459054[2024-10-24T09:50:41.027] [1079.batch] debug: Settingslurmstepd(459054) oom_score_adj to -1000[2024-10-24T09:50:41.027] [1079.batch] debug: switch/none: init: switchNONE plugin loaded[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: coreenforcement enabled[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup:task_cgroup_memory_init: task/cgroup/memory: total:2063720Mallowed:100%(enforced), swap:0%(permissive), max:100%(2063720M)max+swap:100%(4127440M) min:30M kmem:100%(2063720M permissive) min:30M[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: memoryenforcement enabled[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: Taskscontainment cgroup plugin loaded[2024-10-24T09:50:41.027] [1079.batch] cred/munge: init: Mungecredential signature plugin loaded[2024-10-24T09:50:41.027] [1079.batch] debug: job_container/none: init:job_container none plugin loaded[2024-10-24T09:50:41.030] [1079.batch] debug: spank: opening pluginstack /etc/slurm/plugstack.conf[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:task_cgroup_cpuset_create: job abstract cores are '0-63'[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:task_cgroup_cpuset_create: step abstract cores are '0-63'[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:task_cgroup_cpuset_create: job physical CPUs are'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:task_cgroup_cpuset_create: step physical CPUs are'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize:job: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize:step: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited[2024-10-24T09:50:41.064] [1079.batch] debug levels are stderr='error',logfile='debug', syslog='quiet'

[2024-10-24T09:50:41.064] [1079.batch] starting 1 tasks

[2024-10-24T09:50:41.064] [1079.batch] task 0 (459058) started2024-10-24T09:50:41[2024-10-24T09:50:41.069] [1079.batch] _set_limit: RLIMIT_NOFILE :reducing req:1048576 to max:131072[2024-10-24T09:51:23.066] debug: _rpc_terminate_job: uid = 64030JobId=1079

[2024-10-24T09:51:23.067] debug:  credential for job 1079 revoked

[2024-10-24T09:51:23.067] [1079.batch] debug: HandlingREQUEST_SIGNAL_CONTAINER[2024-10-24T09:51:23.067] [1079.batch] debug: _handle_signal_containerfor StepId=1079.batch uid=64030 signal=18[2024-10-24T09:51:23.068] [1079.batch] Sent signal 18 toStepId=1079.batch[2024-10-24T09:51:23.068] [1079.batch] debug: HandlingREQUEST_SIGNAL_CONTAINER[2024-10-24T09:51:23.068] [1079.batch] debug: _handle_signal_containerfor StepId=1079.batch uid=64030 signal=15[2024-10-24T09:51:23.068] [1079.batch] error: *** JOB 1079 ON mihaigpu2CANCELLED AT 2024-10-24T09:51:23 ***[2024-10-24T09:51:23.069] [1079.batch] Sent signal 15 toStepId=1079.batch

[2024-10-24T09:51:23.069] [1079.batch] debug:  Handling REQUEST_STATE

[2024-10-24T09:51:23.071] [1079.batch] task 0 (459058) exited. Killed bysignal 15.

[2024-10-24T09:51:23.090] [1079.batch] debug:  Handling REQUEST_STATE
[2024-10-24T09:51:23.141] [1079.batch] debug:  Handling REQUEST_STATE
[2024-10-24T09:51:23.241] [1079.batch] debug:  Handling REQUEST_STATE
[2024-10-24T09:51:23.741] [1079.batch] debug:  Handling REQUEST_STATE
[2024-10-24T09:51:24.073] [1079.batch] debug:  signaling condition

[2024-10-24T09:51:24.073] [1079.batch] debug: jobacct_gather/cgroup:fini: Job accounting gather cgroup plugin unloaded[2024-10-24T09:51:24.073] [1079.batch] debug: task/cgroup: fini: Taskscontainment cgroup plugin unloaded[2024-10-24T09:51:24.073] [1079.batch] debug: get_exit_code task 0killed by cmd[2024-10-24T09:51:24.073] [1079.batch] job 1079 completed with slurm_rc= 0, job_rc = 15

[2024-10-24T09:51:24.075] [1079.batch] debug:  Message thread exited
[2024-10-24T09:51:24.154] [1079.batch] done with job


Anyone have any idea what could be the problem?

Thank you,
Mihai

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] jobs dropping

Reply via email to