Thanks for the info.
Thing is that I don't want to totally set the node as unhealthy. Assume the
following scenarios:
compute-0-0 running slurm jobs and system load is 15 (32 cores)
compute-0-1 running non-slurm jobs and system load is 25 (32 cores)
Then a new slurm job should be dispatched to com
On 23/4/19 3:02 pm, Jeffrey R. Lang wrote:
Looking at the nodelist and the NumNodes they are both incorrect. They
should show the first node and then the additional nodes assigned.
You're only looking at the second of the two pack jobs for your
submission, could they be assigned in the firs
I'm testing using heterogenous jobs for a user on out cluster, but seeing I
think incorrect output from "scontrol show job XXX" for the job. The cluster is
currently using Slurm 18.08.
So my job script looks like this:
#!/bin/sh
### This is a general SLURM script. You'll need to make modificat
Hi Sven,
You'll probably be better served by switching your purge time units to
hours instead of months; this will provoke purging much smaller amounts of
data, much more frequently (once per hour instead of once per month). Also,
depending on your job throughput, and how long your DB has been sto
On 4/23/19 2:47 AM, Mahmood Naderan wrote:
Hi,
How can I change the job distribution policy? Since some nodes are
running non-slurm jobs, it seems that the dispatcher isn't aware of
system load. Therefore, it assumes that the node is free.
I want to change the policy based on the system load
Hi Mahmood,
Try the LBNL Node Health Check tool. Nodes which are determined to be
"unhealthy" can be marked as down or offline so as to prevent jobs from being
scheduled or run on them.
https://github.com/mej/nhc/blob/master/README.md#lbnl-node-health-check-nhc
Regards,
Richard
@cnscfr
--
Sent