Hello, I am trying to create a Slurm cluster running on Kubernetes. I am running Slurm version 17.02.6.
The issue I am facing, is that when starting a job, although the scheduler allocates resources, a job_step is never created. If the command is created using srun it hangs, if it is created using sbatch, it runs but if in the meantime the slurmctl is restarted, the job is killed. I suspect this might have to do with a networking issue, although the pods each have access to each other. Bear in mind though that the connection between the Kubernetes pods, go through a proxy so incoming connection to the slurm controller appear to come from one of the physical nodes of the cluster and a random port, rather than the worker directly. Could this have something to do with the issue?
