Re: [slurm-users] [External] Re: What is an easy way to prevent users run programs on the master/login node.

2021-05-19 Thread Marcus Wagner
Hi Prentice, you are right, and I looked into the wrapper script (not my part, never did anything in that thing). In fact the mpi processes are spawned on the backend nodes, the only process remaining on the login/frontend node is the spawner process. The wrapper checks the load on the nodes a

[slurm-users] nodes going to down* and getting stuck in that state

2021-05-19 Thread Herc Silverstein
Hi, We have a cluster (in Google gcp) which has a few partitions set up to auto-scale, but one partition is set up to not autoscale. The desired state is for all of the nodes in this non-autoscaled partition (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.  However, we

Re: [slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-19 Thread Tim Carlson
As a follow-up, we did figure out that if we set the partition to not be exclusive we get something that seems more reasonable. That is to say that if I use a partition like this PartitionName=dlt_shared Nodes=dlt[01-12] Default=NO Shared=YES MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00 wit

[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-19 Thread Tim Carlson
Hey folks, Here is my setup: slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1 The relevant parts of the slurm.conf and a particular gres.conf file are: SelectType=select/cons_res SelectTypeParameters=CR_Core PriorityType=priority/multifactor GresTypes=gpu NodeName=dlt[01-12] Gr

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-19 Thread Alan Orth
Regarding setting limits for users on the head node. We had this for years: # CPU time in minutes * - cpu 30 root- cpu unlimited But we eventually found that this was even causing long-running jobs like rsync/scp to fail when users were copying data to the cluster. F