Hello,
I haven't played with slurm in k8s but I did attend this talk :
https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/
Which shows at least someone was able to do so and maybe it'll be worth
to talk to her about it. I wanted to ask her for the code to reproduce
her experiment but I don't have the time yet to do so.
Regards,
Sylvain Maret
On 13/03/2024 11:04, Nicolas Greneche via slurm-users wrote:
CAUTION : External Sender. Please do not click on links or open
attachments from senders you do not trust.
Hi Alan,
Your topic is indeed my PhD thesis (defended late november). It consists
in building autoscaling HPC infrastructure in the cloud (in a compute
node provisioning point of view). In this work I show that kubernetes
default controllers are not well designed for autoscaling containerized
HPC clusters [1] and I wrote a super basic K8s controller for OAR [2]
(another scheduler developped at INRIA). This controller deserves a
rewrite, it's only a proof of concept ^^.
My guess is that you have nesting issues with cgroup/v2 inside your
containerized compute node (those that runs slurmd) ? If it's the case
may be you can use katacontainers [3] instead of CRI-O as a container
engine [4] (I made it in 2021). The main asset of katacontainers is that
it can use KVM.
The consequence is that your CPU is a "real" vcpus and not a quota
enforced by cgroups. I noticed that with kata, my slurmd containers had
a cpuinfos / nproc that reflected the limits enforced in my K8s
manifest. I didn't went deep with kata because I focused on "controller"
aspect of things. But using KVM may hide the nesting of cgroups to
slurmd ?
I hope this help !
Kind regards,
[1] A methodology to scale containerized HPC infrastructures in the
Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at
Euro-par 2022
[2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas
Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on
Interoperability of Supercomputing and Cloud Technologies (Held in
conjunction with SC'22)
[3] https://katacontainers.io
[4]
https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md
Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
I’m a little late to this party but would love to establish contact with
others using slurm in Kubernetes.
I recently joined a research institute in Vienna (IIASA) and I’m getting
to grips with slurm and Kubernetes (my previous role was data
engineering / fintech). My current setup sounds like what Urban
described in this thread, back in Nov 22. It has some rough edges
though.
Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10
containers. I’m having trouble with the cgroup/v2 plugin.
Are you still using slurm on K8s Urban? How did your installation work
out Hans?
Would either of you be willing to share your experiences?
Regards,
Alan.
--
Nicolas Greneche
USPN / DSI
Support à la recherche / RSSI Suppléant
https://www-magi.univ-paris13.fr
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com