Hello,
I have a test cluster consist of two nodes, one as controller and the other as 
compute node. I followed all the steps from slurm documentation and I want to 
run jobs as containers but I get the following error when running podman run 
hello-world on controller node:

time="2024-08-06T12:02:54+02:00" level=warning msg="freezer not supported: 
openat2 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0/cgroup.freeze:
 no such file or directory"
srun: error: arlvm6: task 0: Exited with exit code 1
time="2024-08-06T12:02:54+02:00" level=warning msg="lstat 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0:
 no such file or directory"
time="2024-08-06T12:02:54+02:00" level=error msg="runc run failed: unable to 
start container process: unable to apply cgroup configuration: rootless needs 
no limits + no cgrouppath when no permission is granted for cgroups: mkdir 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0:
 permission denied"

As I tracked on the compute node this path exists  
/sys/fs/cgroup/system.slice/slurmstepd.scope/  but it looks that could not 
create the job_332/step_0/user/arlvm6.ara.332.0.0 .

The cgroup.conf:

CgroupPlugin=cgroup/v2
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

It is the oci.conf:

EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeQuery="runc --rootless=true --root=/run/user/1223609544/ state 
%n.%u.%j.%s.%t"
RunTimeKill="runc --rootless=true --root=/run/user/1223609544/ kill -a 
%n.%u.%j.%s.%t SIGKILL"
RunTimeDelete="runc --rootless=true --root=/run/user/1223609544/ delete --force 
%n.%u.%j.%s.%t"
RunTimeRun="runc --rootless=true --root=/run/user/1223609544/ run 
%n.%u.%j.%s.%t -b %b"

As you see I changed the kill command a bit because without SIGKILL param it 
could not kill the containers. I test again the oci run time on both controller 
and compute nodes and I think might be helpful to mention two points:
the delete command will not work because if you kill the container then there 
is no resource to be deleted at least in my tests.
there is no pause and resume  in oci.conf but I test them and got the same 
error for freezer support and cgroup permissions.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to