Hello, Matthew Brown <brown...@vt.edu> writes:
> Minimum memory required per allocated CPU. ... Note that if the job's > --mem-per-cpu value exceeds the configured MaxMemPerCPU, then the > user's limit will be treated as a memory limit per task Ah, thanks, I should've read the documentation more carefully. From my limited tests today, somehow in the interactive queue all seems OK now, but not so in the 'batch' queue. For example, I just submitted three jobs with different amount of CPUs per job (4, 8 and 16 processes respectively). MaxMemPerCPU is set to 2GB, and these jobs run the 'stress' command, consuming 3GB per process. ,---- | [user@xxx test]$ squeue | JOBID PARTITION NAME USER ST TIME TIME_LIMIT CPUS QOS ACCOUNT NODELIST(REASON) | 127564 batch test user R 9:25 15:00 16 normal ddgroup xxx | 127562 batch test user R 9:25 15:00 4 normal ddgroup xxx | 127563 batch test user R 9:25 15:00 8 normal ddgroup xxx `---- It looks like Slurm is trying to kill the jobs, but somehow not all the processes die (as you can see below, 2 out of the 4 processes in job 127562 are still there after 9 minutes, 3 of the 8 proceeses in job 127563 and 6 of the 16 processes in job 127564): ,---- | [user@xxx test]$ ps -fea | grep stress | user 1853317 1853314 0 22:35 ? 00:00:00 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853319 1853317 66 22:35 ? 00:06:17 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853320 1853317 65 22:35 ? 00:06:11 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853321 1853317 65 22:35 ? 00:06:11 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853328 1853317 65 22:35 ? 00:06:12 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853329 1853317 65 22:35 ? 00:06:12 stress -m 16 -t 600 --vm-keep --vm-bytes 3G | user 1853338 1853337 0 22:35 ? 00:00:00 stress -m 8 -t 600 --vm-keep --vm-bytes 3G | user 1853340 1853338 68 22:35 ? 00:06:32 stress -m 8 -t 600 --vm-keep --vm-bytes 3G | user 1853341 1853338 69 22:35 ? 00:06:34 stress -m 8 -t 600 --vm-keep --vm-bytes 3G | user 1853347 1853316 0 22:35 ? 00:00:00 stress -m 4 -t 600 --vm-keep --vm-bytes 3G | user 1853350 1853347 68 22:35 ? 00:06:29 stress -m 4 -t 600 --vm-keep --vm-bytes 3G | user 1854560 1511070 0 22:45 pts/2 00:00:00 grep stress `---- And these processes are truly using 3GB: ,---- | [user@xxx test]$ ps -v 1853319 | PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND | 1853319 ? R 6:25 8642 11 3149428 3146040 1.1 stress -m 16 -t 600 --vm-keep --vm-bytes 3G `---- Any idea how to solve/debug this? Many thanks, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Tel.: +34 922-605-747 Web.: http://research.iac.es/proyecto/polmag/ GPG: 0x8BDC390B69033F52
smime.p7s
Description: S/MIME cryptographic signature