Hi Alan,
Your topic is indeed my PhD thesis (defended late november). It consists
in building autoscaling HPC infrastructure in the cloud (in a compute
node provisioning point of view). In this work I show that kubernetes
default controllers are not well designed for autoscaling containerized
HPC clusters [1] and I wrote a super basic K8s controller for OAR [2]
(another scheduler developped at INRIA). This controller deserves a
rewrite, it's only a proof of concept ^^.
My guess is that you have nesting issues with cgroup/v2 inside your
containerized compute node (those that runs slurmd) ? If it's the case
may be you can use katacontainers [3] instead of CRI-O as a container
engine [4] (I made it in 2021). The main asset of katacontainers is that
it can use KVM.
The consequence is that your CPU is a "real" vcpus and not a quota
enforced by cgroups. I noticed that with kata, my slurmd containers had
a cpuinfos / nproc that reflected the limits enforced in my K8s
manifest. I didn't went deep with kata because I focused on "controller"
aspect of things. But using KVM may hide the nesting of cgroups to slurmd ?
I hope this help !
Kind regards,
[1] A methodology to scale containerized HPC infrastructures in the
Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at Euro-par 2022
[2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas
Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on
Interoperability of Supercomputing and Cloud Technologies (Held in
conjunction with SC'22)
[3] https://katacontainers.io
[4]
https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md
Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
I’m a little late to this party but would love to establish contact with
others using slurm in Kubernetes.
I recently joined a research institute in Vienna (IIASA) and I’m getting
to grips with slurm and Kubernetes (my previous role was data
engineering / fintech). My current setup sounds like what Urban
described in this thread, back in Nov 22. It has some rough edges though.
Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10
containers. I’m having trouble with the cgroup/v2 plugin.
Are you still using slurm on K8s Urban? How did your installation work
out Hans?
Would either of you be willing to share your experiences?
Regards,
Alan.
--
Nicolas Greneche
USPN / DSI
Support à la recherche / RSSI Suppléant
https://www-magi.univ-paris13.fr
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com