Re: [slurm-users] Job dispatching policy

2019-04-23 Thread Mahmood Naderan
Thanks for the info. Thing is that I don't want to totally set the node as unhealthy. Assume the following scenarios: compute-0-0 running slurm jobs and system load is 15 (32 cores) compute-0-1 running non-slurm jobs and system load is 25 (32 cores) Then a new slurm job should be dispatched to com

Re: [slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-23 Thread Chris Samuel
On 23/4/19 3:02 pm, Jeffrey R. Lang wrote: Looking at the nodelist and the NumNodes they are both incorrect.   They should show the first node and then the additional nodes assigned. You're only looking at the second of the two pack jobs for your submission, could they be assigned in the firs

[slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-23 Thread Jeffrey R. Lang
I'm testing using heterogenous jobs for a user on out cluster, but seeing I think incorrect output from "scontrol show job XXX" for the job. The cluster is currently using Slurm 18.08. So my job script looks like this: #!/bin/sh ### This is a general SLURM script. You'll need to make modificat

Re: [slurm-users] sacctmgr archive example

2019-04-23 Thread Lyn Gerner
Hi Sven, You'll probably be better served by switching your purge time units to hours instead of months; this will provoke purging much smaller amounts of data, much more frequently (once per hour instead of once per month). Also, depending on your job throughput, and how long your DB has been sto

Re: [slurm-users] Job dispatching policy

2019-04-23 Thread Prentice Bisbal
On 4/23/19 2:47 AM, Mahmood Naderan wrote: Hi, How can I change the job distribution policy? Since some nodes are running non-slurm jobs, it seems that the dispatcher isn't aware of system load. Therefore, it assumes that the node is free. I want to change the policy based on the system load

Re: [slurm-users] Job dispatching policy

2019-04-23 Thread Richard Randriatoamanana
Hi Mahmood, Try the LBNL Node Health Check tool. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them. https://github.com/mej/nhc/blob/master/README.md#lbnl-node-health-check-nhc Regards, Richard @cnscfr -- Sent