Hi Chris,

it is not my intention, to do such a job. I'm just trying to reconstruct a bad behaviour. My users are doing such jobs.

The output of job 2 was a bad example as I saw later, that the job was not running already. That output changes for a running job. It more looks like:
   NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=240000M,node=1,billing=58

Let me first give you a little more background:
complete node definition from our slurm.conf:
NodeName=nrm[001-208] Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=187200 Feature=skylake,skx8160,hostok,hpcwork Weight=100622 State=UNKNOWN
complete partition definition from our slurm.conf:
PartitionName=c18m PriorityTier=100 Nodes=ncm[0001-1032],nrm[001-208] State=UP DefMemPerCPU=3900 TRESBillingWeights="CPU=1.0,Mem=0.25G"

We intend to do a fair billing regarding the resources, the user asked for. That is, why we use the TRESBillingWeights together with PriorityFlags=MAX_TRES. If the user uses half of the nodes memory and less cores, he should be billed for half of the node. Same for nodes with GPUs (not in the above definition). If a nodes possesses two GPUs and the user asks for one, he should at least get half the node billed.

So far, what we intended to do. But you can see the problem already in the output of job2. The billing is 58, which is more than the 48 cores of the node. This is, because, the user asked for 10G per cpu and also asked for two tasks. But since the job is exclusive, the job gets 48 CPUs. SLURM now multiplies the number of CPUs with the requested mem per cpu. So, TRES has more memory than the node has altogether.
To make it even worse, I increased the memory:
#SBATCH --mem-per-cpu=90000

output of the job(to be clear: scontrol show jobid):
   NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=4320000M,node=1,billing=1054

I definitely know, we have no node with 4 TB memory. Might be, that I misunderstood the TRES at all, but I thought, they reflect the (theoretical) usage of the job, what is blocked for other jobs. The user does not have a chance to use the memory, he is being accounted for.


I now looked into, what kind of cgroups are generated for these kind of jobs.

excerpt from the manpage (slurm.conf)

       RealMemory
              Size  of  real  memory  on  the  node in megabytes (e.g. "2048").  The default value is 1. Lowering RealMemory with the goal of setting aside some amount for the OS and not available for job allocations will not work as               intended if Memory is not set as a consumable resource in SelectTypeParameters. So one of the *_Memory options need to be enabled for that goal to be accomplished.  Also see MemSpecLimit.

So 187200 MB should be available at max for a job on this partition, right? So, what happens, if we omit the --mem-per-cpu option? We get exactly the number of CPUs times DefMemPerCPU. The billing therefore is 48, as intended.
But what happens, if we e.g. set --mem-per-cpu to 10000 MB?
/sys/fs/cgroup/memory/slurm/uid_40574/job_7195054/memory.limit_in_bytes: 201226977280 This are 191905 MB, as seen on the OS by free -m, and is more than the defined 187200MB RealMemory. As we have no swap enabled on the nodes, this means, the job could crash the node.


This at least does not "feel" right.


Best
Marcus


On 8/20/19 4:58 PM, Christopher Benjamin Coffey wrote:
Hi Marcus,

What is the reason to add "--mem-per-cpu" when the job already has exclusive 
access to the node? Your job has access to all of the memory, and all of the cores on the 
system already. Also note, for non-mpi code like single core job, or shared memory 
threaded job, you want to ask for number of cpus with --cpus-per-task, or -c. Unless you 
are running mpi code, where you will want to use -n, and --ntasks instead to launch n 
copies of the code on n cores. In this case, because you asked for -n2, and also 
specified a mem-per-cpu request, the scheduler is doling out the memory as requested (2 x 
tasks), likely due to having SelectTypeParameters=CR_Core_Memory in slurm.conf.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 8/20/19, 1:37 AM, "slurm-users on behalf of Marcus Wagner" 
<slurm-users-boun...@lists.schedmd.com on behalf of wag...@itc.rwth-aachen.de> wrote:

     Just made another test.
Thanks god, the exclusivity is not "destroyed" completely, only on job
     can run on the node, when the job is exclusive. Nonetheless, this is
     somewhat unintuitive.
     I wonder, if that also has an influence on the cgroups and the process
     affinity/binding.
I will do some more tests. Best
     Marcus
On 8/20/19 9:47 AM, Marcus Wagner wrote:
     > Hi Folks,
     >
     >
     > I think, I've stumbled over a BUG in Slurm regarding the
     > exclusiveness. Might also, I've misinterpreted something. I would be
     > happy, if someone could explain that to me in the latter case.
     >
     > To the background. I have set PriorityFlags=MAX_TRES
     > The TRESBillingWeights are "CPU=1.0,Mem=0.1875G" for a partition with
     > 48 core nodes and RealMemory 187200.
     >
     > ---
     >
     > I have two jobs:
     >
     > job 1:
     > #SBATCH --exclusive
     > #SBATCH --ntasks=2
     > #SBATCH --nodes=1
     >
     > scontrol show <jobid> =>
     >    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
     >    TRES=cpu=48,mem=187200M,node=1,billing=48
     >
     > exactly, what I expected, I got 48 CPUs and therefore the billing is 48.
     >
     > ---
     >
     > job 2 (just added mem-per-cpu):
     > #SBATCH --exclusive
     > #SBATCH --ntasks=2
     > #SBATCH --nodes=1
     > #SBATCH --mem-per-cpu=5000
     >
     > scontrol show <jobid> =>
     >    NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
     >    TRES=cpu=2,mem=10000M,node=1,billing=2
     >
     > Why "destroys" '--mem-per-cpu' exclusivity?
     >
     >
     >
     > Best
     > Marcus
     >
--
     Marcus Wagner, Dipl.-Inf.
IT Center
     Abteilung: Systeme und Betrieb
     RWTH Aachen University
     Seffenter Weg 23
     52074 Aachen
     Tel: +49 241 80-24383
     Fax: +49 241 80-624383
     wag...@itc.rwth-aachen.de
     
https://nam05.safelinks.protection.outlook.com/?url=www.itc.rwth-aachen.de&amp;data=02%7C01%7Cchris.coffey%40nau.edu%7C4a5803448abd497d7cde08d7254995f2%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637018870287848104&amp;sdata=HNuqCBYwrJjBcLGFGYuVKxWe9pqCxt028rrRrJ%2FTYp0%3D&amp;reserved=0

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to