I have some users that are using ray on slurm.
I will preface by saying we are new slurm users, so may not be doing everything 
exactly correct.

The only issue that we came across so far as something that was somewhat ray 
specific that we ran into.
Specifically, and pardon my lack of specificity, the ray user I worked on this 
with is on vacation at the moment, there was an environment variable that 
needed to be unset so that ray wouldn’t kneecap itself if it hit a cpuset 
corner case in cgroup fencing.

Specifically, in this workload, the user spawns a “ray head,” and important to 
mention that this head worker may not have the same resources allocated to it 
as the “ray worker”.
TL;DR the ray head would be given fewer cpus than the worker(s), and in some 
corner cases, the worker pid spawned would inherit a smaller cpuset from an 
environment variable passed from the ray head that is then spawning workers via 
srun.

The user noticed that some workers would be able to get 100% util for their 
allocated cpu resources, where other workers running identical workloads would 
end up at partial usage, which we discovered were due to the cpuset getting 
inherited in a way we didn’t intend for it to.
I’ll have to follow up with the environment variable we had to unset when that 
user is back.

But here is my quick and dirty bash script that was able to show the cpu’s 
allocated to the cgroup, and the pid’s inside the cgroup, which should match, 
but didn’t always, which was our discovery.
Just use the uid of the user submitting the jobs.

> #!/bin/bash
> UID=$1
> 
> for JOB in $(ls -lah /sys/fs/cgroup/cpuset/slurm/uid_$UID/ | grep job | awk 
> -F'_' '{print $2}' | xargs)
>     do
>         echo "Slurm JobID: “$JOB
>         echo -n "Cgroup CPU set: "
>         cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/cpuset.cpus
> 
>         for PID in $(cat 
> /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/step_0/cgroup.procs | xargs)
>             do
>                 echo -n "CPUs allocated for PID "$PID": "
>                 cat /proc/$PID/status | grep Cpus_allowed_list | awk '{print 
> $2}'
>             done
>         echo ""
>     done


> slurmd3:
>     Slurm Job: 409
>     Cgroup CPU set: 0-7
>     CPUs allocated for PID 7907: 0-7
>     CPUs allocated for PID 7912: 0-3
>     CPUs allocated for PID 7931: 0-3
> slurmd1:
>     Slurm Job: 406
>     Cgroup CPU set: 0-3
>     CPUs allocated for PID 7409: 0-3
>     CPUs allocated for PID 7414: 0-3
>     CPUs allocated for PID 7425: 0-3
> slurmd2:
>     Slurm Job: 408
>     Cgroup CPU set: 0-7
>     CPUs allocated for PID 7491: 0-7
>     CPUs allocated for PID 7496: 0-3
>     CPUs allocated for PID 7515: 0-3

But otherwise, I’ve not had issues with users spawning jobs from within jobs, 
but I’m not a seasoned slurm admin, so that may not hold up with others.

Reed

> On Jul 15, 2022, at 4:17 AM, Kamil Wilczek <km...@mimuw.edu.pl> wrote:
> 
> Dear Slurm Users,
> 
> one of my cluster users would like to run a Ray cluster on Slurm.
> I noticed that the batch script example requires running the "srun"
> command on a compute node, which already is allocated:
> https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template
> 
> This is the first time I see or hear about this type of usage
> and I have problems wrapping my head around this.
> Is there anything wrong or unusual about this? I understand that
> this would allocate some resources on other nodes. Would
> Slurm enforce limits properly ("qos" or "partition" limits)?
> 
> Kind Regards
> -- 
> Kamil Wilczek  [https://keys.openpgp.org/]
> [D415917E84B8DA5A60E853B6E676ED061316B69B]

Reply via email to