[slurm-users] Slurm Remote Task Launch

2024-09-04 Thread Bhaskar Chakraborty via slurm-users
Hello,
I have a slurm job which needs to launch multiple tasks across the allocated 
hosts for the job.
My criteria is that most of the tasks need to be launched from within the main 
task launchedby slurm in the  launch compute node.
So, if the allocated hosts are h1, h2 & h3 with h1 being the main launcher node 
then theinitial task, say launchTask, launched in h1 will need to launch 
RemoteTask1, RemoteTask2 in h2 & h3 at somepoint during execution.
Can I use srun from inside launchTask to do so? If yes, what would be the 
syntax / args?If no, then what alternative I have other than using rsh/ssh 
which mayn't be available in the cluster.
Thanks in advance!
Regards,Bhaskar.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Bug? sbatch not respecting MaxMemPerNode setting

2024-09-04 Thread Angel de Vicente via slurm-users
Hello,

we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.

The relevant sections in slurm.conf read:

,
| EnforcePartLimits=ALL
| PartitionName=short   Nodes=. State=UP Default=YES MaxTime=2-00:00:00 
 MaxCPUsPerNode=76  MaxMemPerNode=231000 OverSubscribe=FORCE:1
`

Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):

,
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
| 
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
| 
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
`


But with this submission Slurm is happy:

,
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`

and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:

,
| JobId=133982 UserId=..(10487) GroupId=domain users(2000) Name=test 
JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 
EndTime=2024-09-04T09:11:24 NodeList=.. NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. 
ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup 
QOS=domino WcKey= Cluster=.. SubmitTime=2024-09-04T09:11:17 
EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`


What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract? 

Cheers,
-- 
Ángel de Vicente  
 Research Software Engineer (Supercomputing and BigData)
 Instituto de Astrofísica de Canarias (https://www.iac.es/en)


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Configuration for nodes with different TmpFs locations and TmpDisk sizes

2024-09-04 Thread Jake Longo via slurm-users
Hi,

We have a number of machines in our compute cluster that have larger disks
available for local data. I would like to add them to the same partition as
the rest of the nodes but assign them a larger TmpDisk value which would
allow users to request a larger tmp and land on those machines.

The main hurdle is that (for reasons beyond my control) the larger local
disks are on a special mount point /largertmp whereas the rest of the
compute cluster uses the vanilla /tmp. I can't see an obvious way to make
this work as the TmpFs value appears to be global only and attempting to
set TmpDisk to a value larger than TmpFs for those nodes will put the
machine into an invalid state.

I couldn't see any similar support tickets or anything in the mail archive
but I wouldn't have thought it would be that unusual to do this.

Thanks in advance!
Jake

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting

2024-09-04 Thread Brian Andrus via slurm-users

Angel,

Unless you are using cgroups and constraints, there is no limit imposed. 
The numbers are used by slurm to track what is available, not what you 
may/may not use. So you could tell slurm the node only has 1GB and it 
will not let you request more than that, but if you do request only 1GB, 
without specific configuration, there is nothing stopping you from using 
more than that.


So your request did not exceed what slurm sees as available (1 cpu using 
4GB), so it is happy to let your script run. I suspect if you look at 
the usage, you will see that 1 cpu spiked high while the others did nothing.


Brian Andrus

On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote:

Hello,

we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.

The relevant sections in slurm.conf read:

,
| EnforcePartLimits=ALL
| PartitionName=short   Nodes=. State=UP Default=YES MaxTime=2-00:00:00 
 MaxCPUsPerNode=76  MaxMemPerNode=231000 OverSubscribe=FORCE:1
`

Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):

,
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
|
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
|
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
`


But with this submission Slurm is happy:

,
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`

and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:

,
| JobId=133982 UserId=..(10487) GroupId=domain users(2000) Name=test 
JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 
EndTime=2024-09-04T09:11:24 NodeList=.. NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. 
ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup 
QOS=domino WcKey= Cluster=.. SubmitTime=2024-09-04T09:11:17 
EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`


What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract?

Cheers,


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com