Hi Prentice,
you are right, and I looked into the wrapper script (not my part, never did
anything in that thing).
In fact the mpi processes are spawned on the backend nodes, the only process
remaining on the login/frontend node is the spawner process.
The wrapper checks the load on the nodes a
Hi,
We have a cluster (in Google gcp) which has a few partitions set up to
auto-scale, but one partition is set up to not autoscale. The desired
state is for all of the nodes in this non-autoscaled partition
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.
However, we
As a follow-up, we did figure out that if we set the partition to not be
exclusive we get something that seems more reasonable.
That is to say that if I use a partition like this
PartitionName=dlt_shared Nodes=dlt[01-12] Default=NO Shared=YES
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00
wit
Hey folks,
Here is my setup:
slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1
The relevant parts of the slurm.conf and a particular gres.conf file are:
SelectType=select/cons_res
SelectTypeParameters=CR_Core
PriorityType=priority/multifactor
GresTypes=gpu
NodeName=dlt[01-12] Gr
Regarding setting limits for users on the head node. We had this for years:
# CPU time in minutes
* - cpu 30
root- cpu unlimited
But we eventually found that this was even causing long-running jobs like
rsync/scp to fail when users were copying data to the cluster. F