[slurm-users] PMIx + openMPI with heterogeneous jobs
I am facing the same problem that was quoted long ago (2019) in this mailing mailing reference: https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html but with more recent version of slurm i.e: slurm 21.08.8-2 PMIx 2.2.5 (pmix-2.2.5-1.el8.src.rpm) openMPI 4.1.5 In a similar way to my predecessor, running MPI heterogeneous jobs (OSU benchmarks) using this slurm+PMIx version installed on the host gives sporadically this type of error >>> slurmstepd: error: mpi/pmix_v2: _tcp_connect: lxbk1177 [0]: pmixp_dconn_tcp.c:139: Cannot establish the connection slurmstepd: error: mpi/pmix_v2: pmixp_dconn_connect: lxbk1177 [0]: pmixp_dconn.h:246: Cannot establish direct connection to lxbk1177 (0) slurmstepd: error: mpi/pmix_v2: _process_extended_hdr: lxbk1177 [0]: pmixp_server.c:738: Unable to connect to 0 slurmstepd: error: mpi/pmix_v2: pmixp_coll_ring_check: lxbk1177 [0]: pmixp_coll_ring.c:618: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, expected is 1 slurmstepd: error: mpi/pmix_v2: _process_server_request: lxbk1177 [0]: pmixp_server.c:942: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, coll->seq=0, seq=0 >>> So very similar problem indeed. Additionally when the jobs completes, from time to time it cannot finish properly and stay in RUNNING state an one needs to manually cancel the job. Is the hetjob functionality really supporting this case? If yes, any ideas what can be wrong here? Job submission details: == - submit script: sbatch --ntasks 1 --ntasks-per-core 1 --cpus-per-task 2 -p main -D ./data -o %j.out.log -e %j.err.log : --ntasks 1 --ntasks-per-core 1 --cpus-per-task 1 -p main -D ./data -o %j.out.log -e %j.err.log ./run-file.sh - run-file.sh: export CONT=.sif srun -vv --mpi=pmix --export=ALL : $CONT collective/osu_allreduce -f -i 100 -x 10 - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz
[slurm-users] Restrictions for new/inefficient users?
Hi, We have the problem that increasing numbers of new users have little to no idea about the amount of resources their programs can use efficiently. Thus, they will often just request 32 cores, because that's what most of our nodes have, and 128 or 256 GB, for reasons which are unclear to me, even though the majority of our nodes don't have that much RAM. I know that Ole's slurmaccounts has a NEWUSER flag which allows resources to be restricted for uses for a certain period of time. I worry that in our case some users work quite sporadically and don't acquire experience very quickly, so that lifting the restrictions after a fixed time might not be that useful. Another approach might be to trigger an EXPERT flag if some appropriate combination of percentage efficiency and absolute resource usage is reached. Has any one tried anything like this or some other approach? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
[slurm-users] hi-priority partition and preemption
Hi all, i'm trying to have two overlapping partition, say normal and hi-pri, so that when jobs are launched in the second one they can preempt the jobs allready running in the first one, automatically putting them in suspend state. After completition, the jobs in the normal partition must be automatically resumed. here are my (relevant) slurm.conf settings: > PreemptMode=suspend,gang > PreemptType=preempt/partition_prio > > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 > AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 > AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off But so, jobs in the hi-pri partition where put in PD state and the ones allready running in the normal partition continue in their R status. What i'm wrong? What i'm missing? Since i have jobs thath must run at specific time and must have priority over all others, is this the correct way to do? Thanks FR
Re: [slurm-users] hi-priority partition and preemption
Hi Fabrizio, Fabrizio Roccato writes: > Hi all, > i'm trying to have two overlapping partition, say normal and hi-pri, > so that when jobs are launched in the second one they can preempt the jobs > allready running in the first one, automatically putting them in suspend > state. After completition, the jobs in the normal partition must be > automatically resumed. > > here are my (relevant) slurm.conf settings: > >> PreemptMode=suspend,gang >> PreemptType=preempt/partition_prio >> >> PartitionName=normal Nodes=node0[01-08] MaxTime=1800 >> PriorityTier=100 AllowAccounts=group1,group2 OverSubscribe=FORCE:20 >> PreemptMode=suspend >> PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 >> AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off > > But so, jobs in the hi-pri partition where put in PD state and the ones > allready running in the normal partition continue in their R status. > What i'm wrong? What i'm missing? We don't do anything like this so the following may be incorrect. However, my understanding is that even if two partitions include the same node, the node can only be in one partition at one time. So if a job requests the partition 'normal' and the job starts, that node is the in the partition 'normal' only, so no job requesting 'hi-prio' can start on the node, because it is not a member of that partition. We use QOS to set different priorities, but we don't use preemption. > Since i have jobs thath must run at specific time and must have priority over > all others, is this the correct way to do? For this I would probably use a recurring reservation. Cheers, Loris > Thanks > > FR -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
[slurm-users] slurmstepd error after upgrade to 23.02
Hi all, we have recently upgraded slurm to 23.02. Since then we are getting the following error in our logs May 21 03:23:27 s-sc-gpu001 slurmstepd[2723991]: error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error May 21 03:24:27 s-sc-gpu001 slurmstepd[2723991]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded once every minute. We are using rocky 8, slurm is installed from rpms which we build using mock. I have noticed that there might be some package discrepancies between the mock chroot in which we build the rpm and the versions on the nodes. At the moment this is my best guess as to what is going on. Has anybody seen this issue before and even better do you know how to solve it? Regards magnus -- Magnus Hagdorn Charité – Universitätsmedizin Berlin Geschäftsbereich IT | Scientific Computing Campus Charité Virchow Klinikum Forum 4 | Ebene 02 | Raum 2.020 Augustenburger Platz 1 13353 Berlin magnus.hagd...@charite.de https://www.charite.de HPC Helpdesk: sc-hpc-helpd...@charite.de smime.p7s Description: S/MIME cryptographic signature
Re: [slurm-users] hi-priority partition and preemption
What you are describing is definitely doable. We have our system setup similarly. All nodes are in the "open" partition and "prio" partition, but a job submitted to the "prio" partition will preempt the open jobs. I don't see anything clearly wrong with your slurm.conf settings. Ours are very similar, though we use only FORCE:1 for oversubscribe. You might try that just to see if there's a difference. What are the sbatch settings you are using when you submit the jobs? Do you have PreemptExemptTime set to anything in slurm.conf? What is the reason squeue gives for the high priority jobs to be pending? For your "run regularly" goal, you might consider scrontab. If we can figure out priority and preemption, then that will start the job at a regular time. Rob From: slurm-users on behalf of Fabrizio Roccato Sent: Wednesday, May 24, 2023 7:17 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] hi-priority partition and preemption [You don't often get email from f.rocc...@isac.cnr.it. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] Hi all, i'm trying to have two overlapping partition, say normal and hi-pri, so that when jobs are launched in the second one they can preempt the jobs allready running in the first one, automatically putting them in suspend state. After completition, the jobs in the normal partition must be automatically resumed. here are my (relevant) slurm.conf settings: > PreemptMode=suspend,gang > PreemptType=preempt/partition_prio > > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 > AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 > AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off But so, jobs in the hi-pri partition where put in PD state and the ones allready running in the normal partition continue in their R status. What i'm wrong? What i'm missing? Since i have jobs thath must run at specific time and must have priority over all others, is this the correct way to do? Thanks FR
[slurm-users] Usage gathering for GPUs
Hi, The release notes for 23.02 say "Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins". How would I go about enabling this? Thanks! -- Ben Fulton Research Applications and Deep Learning Research Technologies Indiana University
Re: [slurm-users] Usage gathering for GPUs
On 5/24/23 11:39 am, Fulton, Ben wrote: Hi, Hi Ben, The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”. How would I go about enabling this? I can only comment on the nvidia side (as those are the GPUs we have) but for that you need Slurm built with NVML support and running with "Autodetect=NVML" in gres.conf and then that information is stored in slurmdbd as part of the TRES usage data. For example to grab a job step for a test code I ran the other day: csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | tr , \\n | fgrep gpu gres/gpumem=493120K gres/gpuutil=76 Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA