On 1/19/21 11:25 PM, Sean Crosby wrote:
Hi Adrian,
Hi!

 From this output

AVAIL NODES(A/I/O/T)  CPUS(A/I/O/T)    DEFAULTTIME    TIMELIMIT
up      23/0/0/23     837/587/0/1424   1-00:00:00   2-00:00:00

It shows that all 23 nodes have at least one job running on it.

So what happens if you run scontrol show node on a few of the nodes? I'm particularly interested in the AllocTRES section

e.g. for one of my nodes,

# scontrol show node spartan-bm055 | grep 'NodeName\|CfgTRES\|AllocTRES'
NodeName=spartan-bm055 Arch=x86_64 CoresPerSocket=18
    CfgTRES=cpu=72,mem=1519000M,billing=6005
    AllocTRES=cpu=72,mem=441840M

It shows that for this node, it has 72 cores and 1.5TB RAM (the CfgTRES part), and currently jobs are using 72 cores, and 442GB RAM.

I would run the same command on 4 or 5 of the nodes on your cluster, and we'll have a better idea about what's going on.
First of all thanks for answering and for the tip (i did not thought to look there)

But it turned out that my fs.file-max that was ok so far with 65500 on the nodes with 48 slots, was no longer enough for the new single socket nodes with 128 slots :) this killed the service that was serving the actual files/software for the jobs so, those nodes were a little bit zombies : the already present jobs were in memory and actual running but no other jobs could be started on those nodes.

Thanks and again sorry for the noise!
Adrian



Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Wed, 20 Jan 2021 at 06:50, Adrian Sevcenco <adrian.sevce...@spacescience.ro <mailto:adrian.sevce...@spacescience.ro>> wrote:

    UoM notice: External email. Be cautious of links, attachments, or
    impersonation attempts

    Hi! So, i have a very strange situation that i do not even know how to
    troubleshoot...
    I'm running with
    SelectType=select/cons_res
    SelectTypeParameters=CR_CPU_Memory,CR_LLN
    TaskPlugin=task/affinity,task/cgroup
    TaskPluginParam=autobind=threads

    and a partition defined with:
    LLN=yes DefMemPerCPU=4000 MaxMemPerCPU=4040

    PriorityType=priority/basic
    SchedulerType=sched/builtin

    This is a HEP cluster, so only serial single thread jobs.

    (physically all nodes have 4 GB/thread)
    the nodes are defined (now, only after a lot of experimentation and
    realization that if the node properties could and are incompatible with
    CR_CPU) just with CPUs and RealMemory defined (obtained from slurmd -C
    on each node)

    and with FastSchedule=0

    the problem is that the partition is stuck to a low number (around 834
    from 1424)

    AVAIL NODES(A/I/O/T)  CPUS(A/I/O/T)    DEFAULTTIME    TIMELIMIT
    up      23/0/0/23     837/587/0/1424   1-00:00:00   2-00:00:00


    i set up SlurmctldDebug=debug and
    DebugFlags=Priority,SelectType,NodeFeatures,CPU_Bind,NO_CONF_HASH

    but i am not able to recognize anything as a problem.

    Do anyone have any idea why not all my slots would be used?

    Thank you!!
    Adrian




--
----------------------------------------------
Adrian Sevcenco, Ph.D.                       |
Institute of Space Science - ISS, Romania    |
adrian.sevcenco at {cern.ch,spacescience.ro} |
----------------------------------------------


Reply via email to