[slurm-users] Slurm Remote Task Launch
Hello, I have a slurm job which needs to launch multiple tasks across the allocated hosts for the job. My criteria is that most of the tasks need to be launched from within the main task launchedby slurm in the launch compute node. So, if the allocated hosts are h1, h2 & h3 with h1 being the main launcher node then theinitial task, say launchTask, launched in h1 will need to launch RemoteTask1, RemoteTask2 in h2 & h3 at somepoint during execution. Can I use srun from inside launchTask to do so? If yes, what would be the syntax / args?If no, then what alternative I have other than using rsh/ssh which mayn't be available in the cluster. Thanks in advance! Regards,Bhaskar. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Bug? sbatch not respecting MaxMemPerNode setting
Hello, we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1. The relevant sections in slurm.conf read: , | EnforcePartLimits=ALL | PartitionName=short Nodes=. State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1 ` Now, if I submit a job requesting 76 CPUs and each one needing 4000M (for a total of 304000M), Slurm does indeed respect the MaxMemPerNode setting and the job is not submitted in the following cases ("-N 1" is not really necessary, as there is only one node): , | $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available ` But with this submission Slurm is happy: , | $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | Submitted batch job 133982 ` and the slurmjobcomp.log file does indeed tell me that the memory went above MaxMemPerNode: , | JobId=133982 UserId=..(10487) GroupId=domain users(2000) Name=test JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 EndTime=2024-09-04T09:11:24 NodeList=.. NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup QOS=domino WcKey= Cluster=.. SubmitTime=2024-09-04T09:11:17 EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0 ` What is the best way to report issues like this to the Slurm developers? I thought of adding it to https://support.schedmd.com/ but it is not clear to me if that page is only meant for Slurm users with a Support Contract? Cheers, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Instituto de Astrofísica de Canarias (https://www.iac.es/en) -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Configuration for nodes with different TmpFs locations and TmpDisk sizes
Hi, We have a number of machines in our compute cluster that have larger disks available for local data. I would like to add them to the same partition as the rest of the nodes but assign them a larger TmpDisk value which would allow users to request a larger tmp and land on those machines. The main hurdle is that (for reasons beyond my control) the larger local disks are on a special mount point /largertmp whereas the rest of the compute cluster uses the vanilla /tmp. I can't see an obvious way to make this work as the TmpFs value appears to be global only and attempting to set TmpDisk to a value larger than TmpFs for those nodes will put the machine into an invalid state. I couldn't see any similar support tickets or anything in the mail archive but I wouldn't have thought it would be that unusual to do this. Thanks in advance! Jake -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting
Angel, Unless you are using cgroups and constraints, there is no limit imposed. The numbers are used by slurm to track what is available, not what you may/may not use. So you could tell slurm the node only has 1GB and it will not let you request more than that, but if you do request only 1GB, without specific configuration, there is nothing stopping you from using more than that. So your request did not exceed what slurm sees as available (1 cpu using 4GB), so it is happy to let your script run. I suspect if you look at the usage, you will see that 1 cpu spiked high while the others did nothing. Brian Andrus On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote: Hello, we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1. The relevant sections in slurm.conf read: , | EnforcePartLimits=ALL | PartitionName=short Nodes=. State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1 ` Now, if I submit a job requesting 76 CPUs and each one needing 4000M (for a total of 304000M), Slurm does indeed respect the MaxMemPerNode setting and the job is not submitted in the following cases ("-N 1" is not really necessary, as there is only one node): , | $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available ` But with this submission Slurm is happy: , | $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | Submitted batch job 133982 ` and the slurmjobcomp.log file does indeed tell me that the memory went above MaxMemPerNode: , | JobId=133982 UserId=..(10487) GroupId=domain users(2000) Name=test JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 EndTime=2024-09-04T09:11:24 NodeList=.. NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup QOS=domino WcKey= Cluster=.. SubmitTime=2024-09-04T09:11:17 EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0 ` What is the best way to report issues like this to the Slurm developers? I thought of adding it to https://support.schedmd.com/ but it is not clear to me if that page is only meant for Slurm users with a Support Contract? Cheers, -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com