Hi Alan,

Your topic is indeed my PhD thesis (defended late november). It consists in building autoscaling HPC infrastructure in the cloud (in a compute node provisioning point of view). In this work I show that kubernetes default controllers are not well designed for autoscaling containerized HPC clusters [1] and I wrote a super basic K8s controller for OAR [2] (another scheduler developped at INRIA). This controller deserves a rewrite, it's only a proof of concept ^^.

My guess is that you have nesting issues with cgroup/v2 inside your containerized compute node (those that runs slurmd) ? If it's the case may be you can use katacontainers [3] instead of CRI-O as a container engine [4] (I made it in 2021). The main asset of katacontainers is that it can use KVM.

The consequence is that your CPU is a "real" vcpus and not a quota enforced by cgroups. I noticed that with kata, my slurmd containers had a cpuinfos / nproc that reflected the limits enforced in my K8s manifest. I didn't went deep with kata because I focused on "controller" aspect of things. But using KVM may hide the nesting of cgroups to slurmd ?

I hope this help !

Kind regards,

[1] A methodology to scale containerized HPC infrastructures in the Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at Euro-par 2022

[2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on Interoperability of Supercomputing and Cloud Technologies (Held in conjunction with SC'22)

[3] https://katacontainers.io

[4] https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md

Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
I’m a little late to this party but would love to establish contact with others using slurm in Kubernetes.

I recently joined a research institute in Vienna (IIASA) and I’m getting to grips with slurm and Kubernetes (my previous role was data engineering / fintech). My current setup sounds like what Urban described in this thread, back in Nov 22. It has some rough edges though.

Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10 containers. I’m having trouble with the cgroup/v2 plugin.

Are you still using slurm on K8s Urban? How did your installation work out Hans?
Would either of you be willing to share your experiences?

Regards,

     Alan.




--
Nicolas Greneche
USPN / DSI
Support à la recherche / RSSI Suppléant
https://www-magi.univ-paris13.fr

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to