Hello,

I am trying to create a Slurm cluster running on Kubernetes. I am running
Slurm version 17.02.6.

The issue I am facing, is that when starting a job, although the scheduler
allocates resources, a job_step is never created. If the command is created
using srun it hangs, if it is created using sbatch, it runs but if in the
meantime the slurmctl is restarted, the job is killed.

I suspect this might have to do with a networking issue, although the pods
each have access to each other. Bear in mind though that the connection
between the Kubernetes pods, go through a proxy so incoming connection to
the slurm controller appear to come from one of the physical nodes of the
cluster and a random port, rather than the worker directly.

Could this have something to do with the issue?

Reply via email to