On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote:
Hi everyone, for accounting reasons, I need to create only one job
across two or more federated clusters with two or more srun steps.
The limitations for heterogenous jobs say:
https://slurm.schedmd.com/heterogeneous_jobs.html#limitations
> In a federation of clusters, a heterogeneous job will execute
> entirely on the cluster from which the job is submitted. The
> heterogeneous job will not be eligible to migrate between clusters
> or to have different components of the job execute on different
> clusters in the federation.
However, from your script it's not clear to me that's what you're
meaning, because you include multiple --cluster options. I'm not sure if
that works, as you mention the docs don't cover that case. They do say
(however) that:
> If a heterogeneous job is submitted to run in multiple clusters not
> part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then
> the entire job will be sent to the cluster expected to be able to
> start all components at the earliest time.
My gut instinct is that this isn't going to work, my feeling is that to
launch a heterogenous job like this requires the slurmctld's on each
cluster to coordinate and I'm not aware of that being possible currently.
All the best,
Chris
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com