[slurm-users] PMIx + openMPI with heterogeneous jobs

2023-05-24 Thread Bertini, Denis Dr.
I am facing the same problem that was quoted long ago (2019) in this mailing 
mailing reference:


https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html


but with more recent version of slurm i.e:


slurm 21.08.8-2
PMIx 2.2.5 (pmix-2.2.5-1.el8.src.rpm)
openMPI 4.1.5

In  a similar way to my predecessor, running MPI heterogeneous jobs (OSU 
benchmarks) using this
slurm+PMIx version installed on the host gives sporadically this type of error

>>>
slurmstepd: error:  mpi/pmix_v2: _tcp_connect: lxbk1177 [0]: 
pmixp_dconn_tcp.c:139: Cannot establish the connection
slurmstepd: error:  mpi/pmix_v2: pmixp_dconn_connect: lxbk1177 [0]: 
pmixp_dconn.h:246: Cannot establish direct connection to lxbk1177 (0)
slurmstepd: error:  mpi/pmix_v2: _process_extended_hdr: lxbk1177 [0]: 
pmixp_server.c:738: Unable to connect to 0
slurmstepd: error:  mpi/pmix_v2: pmixp_coll_ring_check: lxbk1177 [0]: 
pmixp_coll_ring.c:618: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, 
expected is 1
slurmstepd: error:  mpi/pmix_v2: _process_server_request: lxbk1177 [0]: 
pmixp_server.c:942: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, 
coll->seq=0, seq=0
>>>

So very similar problem indeed.
Additionally when the jobs completes, from time to time it cannot finish 
properly and stay in RUNNING state an one needs to manually
cancel the job.

Is the hetjob functionality really supporting this case?
If yes, any ideas what can be wrong here?



Job submission details:
==


- submit script:

sbatch --ntasks 1 --ntasks-per-core 1 --cpus-per-task 2   -p main  -D ./data -o 
%j.out.log -e %j.err.log : --ntasks 1 --ntasks-per-core 1 --cpus-per-task 1  -p 
main  -D ./data -o %j.out.log -e %j.err.log  ./run-file.sh



- run-file.sh:



export CONT=.sif

srun  -vv --mpi=pmix --export=ALL : $CONT collective/osu_allreduce -f -i 100 -x 
10




-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz


[slurm-users] Restrictions for new/inefficient users?

2023-05-24 Thread Loris Bennett
Hi,

We have the problem that increasing numbers of new users have little to
no idea about the amount of resources their programs can use
efficiently.  Thus, they will often just request 32 cores, because
that's what most of our nodes have, and 128 or 256 GB, for reasons which
are unclear to me, even though the majority of our nodes don't have that
much RAM.

I know that Ole's slurmaccounts has a NEWUSER flag which allows
resources to be restricted for uses for a certain period of time.  I
worry that in our case some users work quite sporadically and don't
acquire experience very quickly, so that lifting the restrictions
after a fixed time might not be that useful.

Another approach might be to trigger an EXPERT flag if some appropriate
combination of percentage efficiency and absolute resource usage is
reached.

Has any one tried anything like this or some other approach?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



[slurm-users] hi-priority partition and preemption

2023-05-24 Thread Fabrizio Roccato
Hi all,
i'm trying to have two overlapping partition, say normal and hi-pri,
so that when jobs are launched in the second one they can preempt the jobs
allready running in the first one, automatically putting them in suspend
state. After completition, the jobs in the normal partition must be
automatically resumed.

here are my (relevant) slurm.conf settings:

> PreemptMode=suspend,gang
> PreemptType=preempt/partition_prio
> 
> PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 
> AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend
> PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 
> AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off

But so, jobs in the hi-pri partition where put in PD state and the ones
allready running in the normal partition continue in their R status.
What  i'm wrong? What i'm missing?

Since i have jobs thath must run at specific time and must have priority over
all others, is this the correct way to do?


Thanks 

FR 



Re: [slurm-users] hi-priority partition and preemption

2023-05-24 Thread Loris Bennett
Hi Fabrizio,

Fabrizio Roccato  writes:

> Hi all,
>   i'm trying to have two overlapping partition, say normal and hi-pri,
> so that when jobs are launched in the second one they can preempt the jobs
> allready running in the first one, automatically putting them in suspend
> state. After completition, the jobs in the normal partition must be
> automatically resumed.
>
> here are my (relevant) slurm.conf settings:
>
>> PreemptMode=suspend,gang
>> PreemptType=preempt/partition_prio
>> 
>> PartitionName=normal Nodes=node0[01-08] MaxTime=1800
>> PriorityTier=100 AllowAccounts=group1,group2 OverSubscribe=FORCE:20
>> PreemptMode=suspend
>> PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500
>> AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off
>
> But so, jobs in the hi-pri partition where put in PD state and the ones
> allready running in the normal partition continue in their R status.
> What  i'm wrong? What i'm missing?

We don't do anything like this so the following may be incorrect.
However, my understanding is that even if two partitions include the
same node, the node can only be in one partition at one time.  So if a
job requests the partition 'normal' and the job starts, that node is the
in the partition 'normal' only, so no job requesting 'hi-prio' can start
on the node, because it is not a member of that partition.

We use QOS to set different priorities, but we don't use preemption.

> Since i have jobs thath must run at specific time and must have priority over
> all others, is this the correct way to do?

For this I would probably use a recurring reservation.

Cheers,

Loris

> Thanks 
>
> FR 
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



[slurm-users] slurmstepd error after upgrade to 23.02

2023-05-24 Thread Hagdorn, Magnus Karl Moritz
Hi all,
we have recently upgraded slurm to 23.02. Since then we are getting the
following error in our logs

May 21 03:23:27 s-sc-gpu001 slurmstepd[2723991]: error:
slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error
May 21 03:24:27 s-sc-gpu001 slurmstepd[2723991]: error: hash_g_compute:
hash plugin with id:0 not exist or is not loaded

once every minute. We are using rocky 8, slurm is installed from rpms
which we build using mock. I have noticed that there might be some
package discrepancies between the mock chroot in which we build the rpm
and the versions on the nodes. At the moment this is my best guess as
to what is going on. Has anybody seen this issue before and even better
do you know how to solve it?

Regards
magnus


-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] hi-priority partition and preemption

2023-05-24 Thread Groner, Rob
What you are describing is definitely doable.  We have our system setup 
similarly.  All nodes are in the "open" partition and "prio" partition, but a 
job submitted to the "prio" partition will preempt the open jobs.

I don't see anything clearly wrong with your slurm.conf settings.  Ours are 
very similar, though we use only FORCE:1 for oversubscribe.  You might try that 
just to see if there's a difference.

What are the sbatch settings you are using when you submit the jobs?

Do you have PreemptExemptTime set to anything in slurm.conf?

What is the reason squeue gives for the high priority jobs to be pending?

For your "run regularly" goal, you might consider scrontab.  If we can figure 
out priority and preemption, then that will start the job at a regular time.

Rob


From: slurm-users  on behalf of Fabrizio 
Roccato 
Sent: Wednesday, May 24, 2023 7:17 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] hi-priority partition and preemption

[You don't often get email from f.rocc...@isac.cnr.it. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi all,
i'm trying to have two overlapping partition, say normal and hi-pri,
so that when jobs are launched in the second one they can preempt the jobs
allready running in the first one, automatically putting them in suspend
state. After completition, the jobs in the normal partition must be
automatically resumed.

here are my (relevant) slurm.conf settings:

> PreemptMode=suspend,gang
> PreemptType=preempt/partition_prio
>
> PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 
> AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend
> PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 
> AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off

But so, jobs in the hi-pri partition where put in PD state and the ones
allready running in the normal partition continue in their R status.
What  i'm wrong? What i'm missing?

Since i have jobs thath must run at specific time and must have priority over
all others, is this the correct way to do?


Thanks

FR



[slurm-users] Usage gathering for GPUs

2023-05-24 Thread Fulton, Ben
Hi,

The release notes for 23.02 say "Added usage gathering for gpu/nvml (Nvidia) 
and gpu/rsmi (AMD) plugins".

How would I go about enabling this?

Thanks!
--
Ben Fulton
Research Applications and Deep Learning
Research Technologies
Indiana University



Re: [slurm-users] Usage gathering for GPUs

2023-05-24 Thread Christopher Samuel

On 5/24/23 11:39 am, Fulton, Ben wrote:


Hi,


Hi Ben,

The release notes for 23.02 say “Added usage gathering for gpu/nvml 
(Nvidia) and gpu/rsmi (AMD) plugins”.


How would I go about enabling this?


I can only comment on the nvidia side (as those are the GPUs we have) 
but for that you need Slurm built with NVML support and running with 
"Autodetect=NVML" in gres.conf and then that information is stored in 
slurmdbd as part of the TRES usage data.


For example to grab a job step for a test code I ran the other day:

csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | 
tr , \\n | fgrep gpu

gres/gpumem=493120K
gres/gpuutil=76

Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA