Re: [OMPI users] [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via users
Hi, We see this on our cluster as well — we traced it to because Python loads shared library extensions using RTLD_LOCAL. The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a dependency on libhcoll.so. So the Python module is being loaded with RTLD_LOCAL, anything tha

[OMPI users] Mapping and Ranking in 3.1.3

2018-11-06 Thread Ben Menadue
Hi, Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per node, running with OMP_NUM_THREADS=4. A common placement policy we see is to have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4 cores, rank 2 on the first 4 cores of the second socket, an

[OMPI users] OpenMPI 3.1.2: Run-time failure in UCX PML

2018-09-20 Thread Ben Menadue
Hi, A couple of our users have reported issues using UCX in OpenMPI 3.1.2. It’s failing with this message: [r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with error: local protection error The actual MPI calls provoking this are different between the two applications — one

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Ben Menadue
0x014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277 23 0x0041465e main() ???:0 24 0x0001ed1d __libc_start_main() ???:0 25 0x00414569 _start() ???:0 === > On 12 Jul 2018, at 1:36 pm, Ben Menadue wrote: > > Hi, > > Perha

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Ben Menadue
Hi, Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the application run against our --enable-debug build. Cheers, Ben [raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x45) /short/z00/bjm900/build/openmpi-mofed4.2/o

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread Ben Menadue
Hi All, This looks very much like what I reported a couple of weeks ago with Rmpi and doMPI — the trace looks the same. But as far as I could see, doMPI does exactly what simple_spawn.c does — use MPI_Comm_spawn to create the workers and then MPI_Comm_disconnect them when you call closeCluster

Re: [OMPI users] Python code inconsistency on complex multiplication in MPI (MPI4py)

2018-05-22 Thread Ben Menadue
Hi Jeff, Konstantinos, I think you might want MPI.C_DOUBLE_COMPLEX for your datatype, since np.complex128 is a double-precision. But I think it’s either ignoring this and using the datatype of the object you’re sending or MPI4py is handling the conversion in the backend somewhere. You could act

[OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-16 Thread Ben Menadue
Hi,I’m trying to debug a user’s program that uses dynamic process management through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of the processes is in#0  0x7ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x7ff7130760d3 in PMIx_Disconnect

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
ameters in your > /etc/openmpi-mca-params.conf and run like that. > > -Nathan > > On Apr 05, 2018, at 01:18 AM, Ben Menadue wrote: > >> Hi, >> >> Another interesting point. I noticed that the last two message sizes tested >> (2MB and 4MB) are lower th

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
(MB/s) 2097152 11397.85 4194304 11389.64 This makes me think something odd is going on in the RDMA pipeline. Cheers, Ben > On 5 Apr 2018, at 5:03 pm, Ben Menadue wrote: > > Hi, > > We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and

[OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
Hi, We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB). However, osu_bw is fine. If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, if I increase the number of eager

[OMPI users] Exhausting QPs?

2018-03-13 Thread Ben Menadue
Hi, One of our users is having trouble scaling his code up to 3584 cores (i.e. 128 28-core nodes). It runs fine on 1792 cores (64 nodes), but fails with this at 3584: -- A process failed to create a queue pair. This usually

Re: [OMPI users] Suppressing Nvidia warnings

2017-05-04 Thread Ben Menadue
Hi, Sorry to reply to an old thread, but we’re seeing this message with 2.1.0 built against CUDA 8.0. We're using libcuda.so.375.39. Has anyone had any luck suppressing these messages? Thanks, Ben > On 27 Mar 2017, at 7:13 pm, Roland Fehrenbacher wrote: > >> "SJ" == Sylvain Jeaugey wri

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Ben Menadue
Hi, > On 28 Mar 2017, at 2:00 am, r...@open-mpi.org wrote: > I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” > setting. So why would you expect different results? Ahh — I didn’t realise it auto-detected this. I recall working on a system in the past where I needed to expl

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-26 Thread Ben Menadue
Hi, > On 26 Mar 2017, at 1:13 am, Jeff Squyres (jsquyres) > wrote: > Here's an old post on this list where I cited a paper from the Intel > Technology Journal. Thanks for that link! I need to go through it in detail, but this paragraph did jump out at me: On a processor with Hyper-Threading

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-25 Thread Ben Menadue
Hi Jeff, > On 25 Mar 2017, at 10:31 am, Jeff Squyres (jsquyres) > wrote: > > When you enable HT, a) there's 2 hardware threads active, and b) most of the > resources in the core are effectively split in half and assigned to each > hardware thread. When you disable HT, a) there's only 1 hardw

Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions

2016-08-21 Thread Ben Menadue
: set comm on declaration error, and other questions On Sunday, August 21, 2016, Ben Menadue mailto:ben.mena...@nci.org.au> > wrote: Hi, In Fortran, using uninitialised variables is undefined behaviour. In this case, it’s being initialised to zero (either by the compiler or by vir

Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions

2016-08-21 Thread Ben Menadue
Hi, In Fortran, using uninitialised variables is undefined behaviour. In this case, it’s being initialised to zero (either by the compiler or by virtue of being in untouched memory), and so equivalent to MPI_COMM_WORLD in OpenMPI. Other MPI libraries don’t have MPI_COMM_WORLD .eq. 0 and so t

Re: [OMPI users] Mapping by hwthreads without fully populating sockets

2016-08-16 Thread Ben Menadue
] MCW rank 3 bound to socket 1[core 6[hwt 1]]: [../../../../../..][.B/../../../../..] Cheers, Gilles On 8/16/2016 12:40 PM, Ben Menadue wrote: > Hi, > > I'm trying to map by hwthread but only partially populating sockets. For > example, I'm looking to create arrangements

[OMPI users] Mapping by hwthreads without fully populating sockets

2016-08-15 Thread Ben Menadue
Hi, I'm trying to map by hwthread but only partially populating sockets. For example, I'm looking to create arrangements like this: Rank 0: [B./../../../../../../..][../../../../../../../..] Rank 1: [.B/../../../../../../..][../../../../../../../..] Rank 2: [../../../../../../../..][B./../../../.

[OMPI users] mpirun hanging after MPI_Abort

2016-02-18 Thread Ben Menadue
Hi, I'm investigating an issue with mpirun *sometimes* hanging after programs call MPI_Abort... all of the MPI processes have terminated, however the mpirun is still there. This happens with 1.8.8 and 1.10.2. There look to be two threads, one in this path: #0 0x7fa09c3143b3 in select () from

Re: [OMPI users] Any changes to rmaps in 1.10.2?

2016-01-28 Thread Ben Menadue
* if SMT is enabled, do count cores with at least one allowed hwthread + */ return; } data->npus = 1; On 1/29/2016 11:43 AM, Ben Menadue wrote: > Yes, I'm able to reproduce it on a single node as well. > > Act

Re: [OMPI users] Any changes to rmaps in 1.10.2?

2016-01-28 Thread Ben Menadue
; 13:04 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun hostname > <...hostnames...> > > > Cheers, > Ben > > > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ben > Menadue > Sent: Friday, 29 January 2016 1:01

Re: [OMPI users] Any changes to rmaps in 1.10.2?

2016-01-28 Thread Ben Menadue
echo 0-31 > cpuset.cpus 13:03 bjm900@r60 ~ > cat /cgroup/cpuset/pbspro/4363542.r-man2/cpuset.cpus 0-31 13:04 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun hostname <...hostnames...> Cheers, Ben -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Be

Re: [OMPI users] Any changes to rmaps in 1.10.2?

2016-01-28 Thread Ben Menadue
y changes to rmaps in 1.10.2? Ben, that is not needed if you submit with qsub -l nodes=1:ppn=2 do you observe the same behavior without -np 2 ? Cheers, Gilles On 1/28/2016 7:57 AM, Ben Menadue wrote: > Hi, > > Were there any changes to rmaps in going to 1.10.2? An > othe

Re: [OMPI users] Any changes to rmaps in 1.10.2?

2016-01-28 Thread Ben Menadue
rt of the problem. Is there an MCA parameter in your environment or default param file, perhaps? On Wed, Jan 27, 2016 at 2:57 PM, Ben Menadue mailto:ben.mena...@nci.org.au> > wrote: Hi, Were there any changes to rmaps in going to 1.10.2? An otherwise-identical setup that worked in 1

[OMPI users] Any changes to rmaps in 1.10.2?

2016-01-27 Thread Ben Menadue
Hi, Were there any changes to rmaps in going to 1.10.2? An otherwise-identical setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that there's no CPUs available in a socket... With 1.10.0: $ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000 hostname [r47:18709] m

Re: [OMPI users] hcoll API in 1.10.1

2015-12-23 Thread Ben Menadue
n-mpi.org] On Behalf Of Mike Dubman Sent: Thursday, 24 December 2015 7:14 AM To: Open MPI Users Subject: Re: [OMPI users] hcoll API in 1.10.1 Hi, hcoll is part of MOFED or comes from HPCx. what version of hcoll do you have on your system? Thx On Wed, Dec 23, 2015 at 4:58 AM, Ben Menadue ma

[OMPI users] hcoll API in 1.10.1

2015-12-22 Thread Ben Menadue
Hi, It's probably in plain sight somewhere and I missed it, but is there a minimum version of hcoll needed to build 1.10.1? We have 2.0.0, which allows us to build 1.10.0, but 1.10.1 fails with missing entries in the hcoll_collectives_t structure: CC coll_hcoll_module.lo ../../../../../.

[OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5

2014-12-17 Thread Ben Menadue
Hi PETSc and OpenMPI teams, I'm running into a deadlock in PETSc 3.4.5 with OpenMPI 1.8.3: 1. PetscCommDestroy calls MPI_Attr_delete 2. MPI_Attr_delete acquires a lock 3. MPI_Attr_delete calls Petsc_DelComm_Outer (through a callback) 4. Petsc_DelComm_Outer calls MPI_Attr_get 5. MPI_Attr_get