Re: [slurm-users] GPUs not available after making use of all threads?

Sebastian Schmutzhard-Höfler Mon, 13 Feb 2023 07:30:53 -0800

Hi Brian and Hermann,

true. This makes a lot of sense. I will consider setting up Hermann'sconfiguration, defaulting to "--hint=nomultithread".


Thanks!

Sebastian

On 13.02.23 15:29, Brian Andrus wrote:

Hermann makes a good point.
The concept of hyper-threading is not doubling cores. It is a singlecore that can 'instantly' switch work from one process to another.Only one is being worked on at any given time.
So if I request a single core on a hyper-threaded system, I would notbe pleased to find you are giving it to someone else 1/2 the time. Iwould need to have the actual core assigned. If I request multiplecores and my app is only going to affect itself, then I _may_ benefitfrom hyper-threading.
In general, enabling hyper-threading is not the best practice forefficient HPC jobs. The goal is that every process is utilizing theCPU as close to 100% as possible, which would render hyper-threadingmoot.
Brian Andrus

On 2/13/2023 12:15 AM, Hermann Schwärzler wrote:
Hi Sebastian,

I am glad I could help (although not exactly as expected :-).
With your node-configuration you are "circumventing" how Slurmbehaves, when using "CR_Core": if you read the respective part in
https://slurm.schedmd.com/slurm.conf.html

it says:

"CR_Core
[...] On nodes with hyper-threads, each thread is counted as a CPUto satisfy a job's resource requirement, but multiple jobs are notallocated threads on the same core."
That's why you got a full core (both threads) when allocating a singeCPU. Or e.g. four threads when allocating three CPUs asf.
"Lying" to Slurm about the actual hardware-setup helps to avoid thisbehaviour but are you really confident with potentially running twodifferent jobs on the hyper-threads of the same core?
Regards,
Hermann

On 2/12/23 22:04, Sebastian Schmutzhard-Höfler wrote:
Hi Hermann,

Using your suggested settings did not work for us.
When trying to allocate a single thread with --cpus-per-task=1, itstill reserved a whole CPU (two threads). On the other hand, whenrequesting an even number of threads, it does what it should.
However, I could make it work by using

SelectTypeParameters=CR_Core
NodeName=nodename Sockets=2 CoresPerSocket=128 ThreadsPerCore=1

instead of

SelectTypeParameters=CR_Core
NodeName=nodename Sockets=2 CoresPerSocket=64 ThreadsPerCore=2

So your suggestion brought me in the right direction. Thanks!

If anyone thinks this is complete nonsense, please let me know!

Best wishes,

Sebastian

On 11.02.23 11:13, Hermann Schwärzler wrote:
Hi Sebastian,

we did a similar thing just recently.

We changed our node settings from
NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2CoresPerSocket=32 ThreadsPerCore=2
to
NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=32ThreadsPerCore=2
in order to make use of individual hyper-threads possible (we usethis in combination with
SelectTypeParameters=CR_Core_Memory).
This works as expected: after this, when e.g. asking for--cpus-per-task=4 you will get 4 hyper-threads (2 cores) per task(unless you also specify e.g. "--hint=nomultithread").
So you might try to remove the "CPUs=256" part of yournode-specification to let Slurm do that calculation of the numberof CPUs itself.
BTW: on a side-note: as most of our users do not bother to usehyper-threads or even do not want to as their programs might sufferfrom doing so, we made "--hint=nomultithread" the default in ourinstallation by adding
CliFilterPlugins=cli_filter/lua
to our slurm.conf and creating a cli_filter.lua file in the samedirectory as slurm.conf, that contains this
function slurm_cli_setup_defaults(options, early_pass)
        options['hint'] = 'nomultithread'

        return slurm.SUCCESS
end
(see alsohttps://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example).So if user really want to use hyper-threads they have to add"--hint=multithread" to their job/allocation-options.
Regards,
Hermann

On 2/10/23 00:31, Sebastian Schmutzhard-Höfler wrote:
Dear all,
we have a node with 2 x 64 CPUs (with two threads each) and 8GPUs, running slurm 22.05.5
In order to make use of individual threads, we changed|
|

|SelectTypeParameters=CR_Core||
NodeName=nodename CPUs=256 Sockets=2 CoresPerSocket=64ThreadsPerCore=2 |
to

|SelectTypeParameters=CR_CPU NodeName=nodename CPUs=256|
We are now able to allocate individual threads to jobs, despitethe following error in slurmd.log:
error: Node configuration differs from hardware: CPUs=256:256(hw)Boards=1:1(hw) SocketsPerBoard=256:2(hw) CoresPerSocket=1:64(hw)ThreadsPerCore=1:2(hw)
However, it appears that since this change, we can only make useof 4 out of the 8 GPUs.
The output of "sinfo -o %G" might be relevant.

In the first situation it was

$ sinfo -o %G
GRES
gpu:A100:8(S:0,1)

Now it is:

$ sinfo -o %G
GRES
gpu:A100:8(S:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126)
||Has anyone faced this or a similar issue and can give me somedirections?
Best wishes

Sebastian

||

Re: [slurm-users] GPUs not available after making use of all threads?

Reply via email to