[slurm-users] Re: Best practices for tracking jobs started across multiple clusters for accounting purposes.

2024-08-30 Thread Di Bernardini, Fabio via slurm-users
For example if a job has to use different clusters with Slurm I am forced to 
launch it with two sbatch commands:

sbatch -M cluster1 job1
sbatch -m cluster2 job2

This way I get two different jobids. Using sacct I have not found a way to know 
that the two jobs were launched within the same workflow.
I was hoping not to have to add other components such as Nextflow.

From: David 
Sent: Thursday, August 29, 2024 2:53 PM
To: Di Bernardini, Fabio 
Cc: slurm-users@lists.schedmd.com
Subject: RE: [EXTERNAL] [slurm-users] Best practices for tracking jobs started 
across multiple clusters for accounting purposes.


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hello,

What is meant here by "tracking"? What information are you looking to gather 
and track?

I'd say the simplest answer is using sacct, but I am not sure how 
federated/non-federated setups come into play while using it.

David

On Tue, Aug 27, 2024 at 6:23 AM Di Bernardini, Fabio via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
I need to account for jobs composed of multiple jobs launched on multiple 
federated (and non-federated) clusters, which therefore have different job IDs. 
What are the best practices to prevent users from bypassing this tracking?



NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese 
di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR 
i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico


--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com


--
David Rhey
---
Advanced Research Computing
University of Michigan



NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese 
di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR 
i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: playing with --nodes=

2024-08-30 Thread Brian Andrus via slurm-users


Looks like it is not doing what you think it should. It does state:

If the number of tasks is given and a number of requested nodes is also 
given, the number of nodes used from that request will be reduced to 
match that of the number of tasks if the number of nodes in the request 
is greater than the number of tasks.


Your number of nodes was reduced because you only requested 3 tasks.

You are correct that it does not specify in detail what more than one 
value would do beyond making it a range of min/max. The way I read it, 
you can put multiple values and the other value(s) would be a max nodes. 
The "size_string" looks like parsing code similar to that used for 
arrays. Nowhere does it say what the effect of that is for --nodes.


You can either figure out what you want to actually do and do that, or 
open a bug with schedmd to add some clarity to the documentation. They 
are more than happy to do that.


Brian Andrus


On 8/29/2024 11:48 PM, Matteo Guglielmi via slurm-users wrote:

I'm sorry, but I still don't get it.


Isn't --nodes=2,4 telling slurm to allocate 2 OR 4 nodes and nothing else?


So, if:


--nodes=2 allocates only two nodes

--nodes=4 allocates only four nodes

--nodes=1-2 allocates min one and max two nodes

--nodes=1-4 allocates min one and max four nodes


what is the allocation rule for --nodes=2,4 which is the so-called size_string 
allocation?


man sbatch says:


Node count can also be specified as size_string. The size_string specification 
identifies what nodes

values should be used. Multiple values may be specified using a comma separated 
list or with a step

function by suffix containing a colon and number values with a "-" separator.

For example, "--nodes=1-15:4" is equivalent to "--nodes=1,5,9,13".

...

The job will be allocated as many nodes as possible within the range specified 
and without delaying the

initiation of the job.


From: Brian Andrus via slurm-users
Sent: Thursday, August 29, 2024 7:27:44 PM
To:slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: playing with --nodes=


It looks to me that you requested 3 tasks spread across 2 to 4 nodes. Realize 
--nodes is not targeting your nodes named 2 and 4, it is a count of how many 
nodes to use. You only needed 3 tasks/cpus, so that is what you were allocated 
and you have 1 cpu per node, so you get 3 (of up to 4) nodes. Slurm does not 
give you 4 nodes because you only want 3 tasks.

You see the result in your variables:

SLURM_NNODES=3
SLURM_JOB_CPUS_PER_NODE=1(x3)



If you only want 2 nodes, make --nodes=2.

Brian Andrus

On 8/29/24 08:00, Matteo Guglielmi via slurm-users wrote:

Hi,


On sbatch's manpage there is this example for :


--nodes=1,5,9,13


so either one specifies [-maxnodes] OR .


I checked the logs, and there are no reported errors about wrong or ignored 
options.


MG


From: Brian Andrus via 
slurm-users
Sent: Thursday, August 29, 2024 4:11:25 PM
To:slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: playing with --nodes=


Your --nodes line is incorrect:

-N, --nodes=[-maxnodes]|
Request that a minimum of minnodes nodes be allocated to this job. A maximum 
node count may also be specified with maxnodes.

Looks like it ignored that and used ntasks with ntasks-per-node as 1, giving 
you 3 nodes. Check your logs and check your conf see what your defaults are.

Brian Andrus


On 8/29/2024 5:04 AM, Matteo Guglielmi via slurm-users wrote:

Hello,

I have a cluster with four Intel nodes (node[01-04], Feature=intel) and four 
Amd nodes (node[05-08], Feature=amd).

# job file

#SBATCH --ntasks=3
#SBATCH --nodes=2,4
#SBATCH --constraint="[intel|amd]"


env | grep SLURM


# slurm.conf


PartitionName=DEFAULT  MinNodes=1 MaxNodes=UNLIMITED


# log


SLURM_JOB_USER=software
SLURM_TASKS_PER_NODE=1(x3)
SLURM_JOB_UID=1002
SLURM_TASK_PID=49987
SLURM_LOCALID=0
SLURM_SUBMIT_DIR=/home/software
SLURMD_NODENAME=node01
SLURM_JOB_START_TIME=1724932865
SLURM_CLUSTER_NAME=cluster
SLURM_JOB_END_TIME=1724933465
SLURM_CPUS_ON_NODE=1
SLURM_JOB_CPUS_PER_NODE=1(x3)
SLURM_GTIDS=0
SLURM_JOB_PARTITION=nodes
SLURM_JOB_NUM_NODES=3
SLURM_JOBID=26
SLURM_JOB_QOS=lprio
SLURM_PROCID=0
SLURM_NTASKS=3
SLURM_TOPOLOGY_ADDR=node01
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_MEM_PER_CPU=0
SLURM_NODELIST=node[01-03]
SLURM_JOB_ACCOUNT=dalco
SLURM_PRIO_PROCESS=0
SLURM_NPROCS=3
SLURM_NNODES=3
SLURM_SUBMIT_HOST=master
SLURM_JOB_ID=26
SLURM_NODEID=0
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_JOB_NAME=mpijob
SLURM_JOB_GID=1002

SLURM_JOB_NODELIST=node[01-03] <<<=== why three nodes? Shouldn't this still be 
two nodes?

Thank you.







-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Best practices for tracking jobs started across multiple clusters for accounting purposes.

2024-08-30 Thread Laura Hild via slurm-users
Can whatever is running those sbatch commands add a --comment with a shared 
identifier that AccountingStoreFlags=job_comment would make available in sacct?


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] QOS MaxTRESPU node=X intepretation

2024-08-30 Thread David Magda via slurm-users
Hello,

Have a question on how to interpret a Node=X MaxTRESPU value for a QOS:

If (e.g.) X=4, and each node has (say) 64 CPUs or cores: if a particular job 
needs 32 cores, then would two jobs count as the equivalent of one node 
(2*32=64)? And if X=4 for the QOS, would that mean that eight of these jobs 
would be able to run ((4*64)/32=8)?

Or does this TRES limit only get activated -N/--nodes is used?

Don't see anything obvious:

   https://slurm.schedmd.com/resource_limits.html
   https://slurm.schedmd.com/sacctmgr.html
   https://slurm.schedmd.com/tres.html

Thanks for any info.

Regards,
David

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com