On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote:
On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote:
But when I run a job on the node it runs I can find no
evidence in cgroups of any limits being set
Example job:
mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G
salloc: Granted job allocation 17
mlscgpu1[0]:~$ echo $$
137112
mlscgpu1[0]:~$
You're not actually running inside a job at that point unless you've defined
"SallocDefaultCommand" in your slurm.conf, and I'm guessing that's not the
case there. You can make salloc fire up an srun for you in the allocation
using that option, see the docs here:
https://slurm.schedmd.com/slurm.conf.html#OPT_SallocDefaultCommand
Thank you so much. This also explains my GPU CUDA_VISIBLE_DEVICES missing
problem in my previous post.
As a new SLURM admin, I am a bit suprised at this default behavior.
Seems like a way for users to game the system by never running srun.
The only limit I suppose that is being really enforced at that point
is walltime?
I guess I need to research srun and SallocDefaultCommand more, but is
there some way to set some kind of separate walltime limit on a
job for the time a salloc has to run srun? It is not clear if one
can make a SallocDefaultCommand that does "srun ..." that really
covers all possibilities.