Re: [slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-22 Thread Tommi Tervo
> So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup
> currently on NERSC Perlmutter system.
> 
> The SLURM version on Perlmutter is currently 2023.02.2
> 
> The PMIx version that the admins used to build slurm against is pmix-4.2.3.
> I’ve attached the output of  pmix_info.
> 
> My test application “works” but if I use srun, I get these types of messages:
> 
> srun -n 2 -N 2 --mpi=pmix ./ring_c
> 
> [cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c 
> at
> line 750

Hi,

23.02.2 contains PMIx permission regression, it may be worth to check if it's 
case?

https://bugs.schedmd.com/show_bug.cgi?id=16687

commit 1f9386909230cd73506d88f02f75126924d3f41e
Author: Danny Auble 
Date:   Mon May 15 18:35:25 2023 +0200

mpi/pmix - fix PMIx shmem backed files permissions regression.

Introduced in 23.02.2 commit d23cad68df.

Bug 16687


BR,
Tommi

Re: [slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-22 Thread Christopher Samuel

Hi Tommi, Howard,

On 5/22/23 12:16 am, Tommi Tervo wrote:


23.02.2 contains PMIx permission regression, it may be worth to check if it's 
case?


I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages 
Howard is seeing on a test system, so I tried that patch on that same 
system without any change. :-(


Looking at the PMIx code base the messages appear to come from that code 
(the triggers are in src/mca/bfrops/) and I saw I could set 
PMIX_DEBUG=verbose to get more info on the problem, but when I set that 
these messages go away entirely. :-/


Very odd.

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA