Hi,
We see this on our cluster as well — we traced it to because Python loads
shared library extensions using RTLD_LOCAL.
The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a
dependency on libhcoll.so. So the Python module is being loaded with
RTLD_LOCAL, anything tha
Hi,
Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per
node, running with OMP_NUM_THREADS=4. A common placement policy we see is to
have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4
cores, rank 2 on the first 4 cores of the second socket, an
Hi,
A couple of our users have reported issues using UCX in OpenMPI 3.1.2. It’s
failing with this message:
[r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with error:
local protection error
The actual MPI calls provoking this are different between the two applications
— one
0x014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277
23 0x0041465e main() ???:0
24 0x0001ed1d __libc_start_main() ???:0
25 0x00414569 _start() ???:0
===
> On 12 Jul 2018, at 1:36 pm, Ben Menadue wrote:
>
> Hi,
>
> Perha
Hi,
Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the
application run against our --enable-debug build.
Cheers,
Ben
[raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped
to object at address 0x45)
/short/z00/bjm900/build/openmpi-mofed4.2/o
Hi All,
This looks very much like what I reported a couple of weeks ago with Rmpi and
doMPI — the trace looks the same. But as far as I could see, doMPI does
exactly what simple_spawn.c does — use MPI_Comm_spawn to create the workers and
then MPI_Comm_disconnect them when you call closeCluster
Hi Jeff, Konstantinos,
I think you might want MPI.C_DOUBLE_COMPLEX for your datatype, since
np.complex128 is a double-precision. But I think it’s either ignoring this and
using the datatype of the object you’re sending or MPI4py is handling the
conversion in the backend somewhere. You could act
Hi,I’m trying to debug a user’s program that uses dynamic process management through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of the processes is in#0 0x7ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x7ff7130760d3 in PMIx_Disconnect
ameters in your
> /etc/openmpi-mca-params.conf and run like that.
>
> -Nathan
>
> On Apr 05, 2018, at 01:18 AM, Ben Menadue wrote:
>
>> Hi,
>>
>> Another interesting point. I noticed that the last two message sizes tested
>> (2MB and 4MB) are lower th
(MB/s)
2097152 11397.85
4194304 11389.64
This makes me think something odd is going on in the RDMA pipeline.
Cheers,
Ben
> On 5 Apr 2018, at 5:03 pm, Ben Menadue wrote:
>
> Hi,
>
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and
Hi,
We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that
osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB).
However, osu_bw is fine.
If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly,
if I increase the number of eager
Hi,
One of our users is having trouble scaling his code up to 3584 cores (i.e. 128
28-core nodes). It runs fine on 1792 cores (64 nodes), but fails with this at
3584:
--
A process failed to create a queue pair. This usually
Hi,
Sorry to reply to an old thread, but we’re seeing this message with 2.1.0 built
against CUDA 8.0. We're using libcuda.so.375.39. Has anyone had any luck
suppressing these messages?
Thanks,
Ben
> On 27 Mar 2017, at 7:13 pm, Roland Fehrenbacher wrote:
>
>> "SJ" == Sylvain Jeaugey wri
Hi,
> On 28 Mar 2017, at 2:00 am, r...@open-mpi.org wrote:
> I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed”
> setting. So why would you expect different results?
Ahh — I didn’t realise it auto-detected this. I recall working on a system in
the past where I needed to expl
Hi,
> On 26 Mar 2017, at 1:13 am, Jeff Squyres (jsquyres)
> wrote:
> Here's an old post on this list where I cited a paper from the Intel
> Technology Journal.
Thanks for that link! I need to go through it in detail, but this paragraph did
jump out at me:
On a processor with Hyper-Threading
Hi Jeff,
> On 25 Mar 2017, at 10:31 am, Jeff Squyres (jsquyres)
> wrote:
>
> When you enable HT, a) there's 2 hardware threads active, and b) most of the
> resources in the core are effectively split in half and assigned to each
> hardware thread. When you disable HT, a) there's only 1 hardw
: set comm on declaration error, and
other questions
On Sunday, August 21, 2016, Ben Menadue mailto:ben.mena...@nci.org.au> > wrote:
Hi,
In Fortran, using uninitialised variables is undefined behaviour. In this case,
it’s being initialised to zero (either by the compiler or by vir
Hi,
In Fortran, using uninitialised variables is undefined behaviour. In this case,
it’s being initialised to zero (either by the compiler or by virtue of being in
untouched memory), and so equivalent to MPI_COMM_WORLD in OpenMPI. Other MPI
libraries don’t have MPI_COMM_WORLD .eq. 0 and so t
] MCW rank 3 bound to socket 1[core 6[hwt 1]]:
[../../../../../..][.B/../../../../..]
Cheers,
Gilles
On 8/16/2016 12:40 PM, Ben Menadue wrote:
> Hi,
>
> I'm trying to map by hwthread but only partially populating sockets. For
> example, I'm looking to create arrangements
Hi,
I'm trying to map by hwthread but only partially populating sockets. For
example, I'm looking to create arrangements like this:
Rank 0: [B./../../../../../../..][../../../../../../../..]
Rank 1: [.B/../../../../../../..][../../../../../../../..]
Rank 2: [../../../../../../../..][B./../../../.
Hi,
I'm investigating an issue with mpirun *sometimes* hanging after programs
call MPI_Abort... all of the MPI processes have terminated, however the
mpirun is still there. This happens with 1.8.8 and 1.10.2. There look to be
two threads, one in this path:
#0 0x7fa09c3143b3 in select () from
* if SMT is enabled, do count cores with at least one
allowed hwthread
+ */
return;
}
data->npus = 1;
On 1/29/2016 11:43 AM, Ben Menadue wrote:
> Yes, I'm able to reproduce it on a single node as well.
>
> Act
; 13:04 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun hostname
> <...hostnames...>
>
>
> Cheers,
> Ben
>
>
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ben
> Menadue
> Sent: Friday, 29 January 2016 1:01
echo 0-31 > cpuset.cpus
13:03 bjm900@r60 ~ > cat /cgroup/cpuset/pbspro/4363542.r-man2/cpuset.cpus
0-31
13:04 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun hostname
<...hostnames...>
Cheers,
Ben
-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Be
y changes to rmaps in 1.10.2?
Ben,
that is not needed if you submit with qsub -l nodes=1:ppn=2 do you observe
the same behavior without -np 2 ?
Cheers,
Gilles
On 1/28/2016 7:57 AM, Ben Menadue wrote:
> Hi,
>
> Were there any changes to rmaps in going to 1.10.2? An
> othe
rt of the problem. Is there an MCA parameter in your environment or default
param file, perhaps?
On Wed, Jan 27, 2016 at 2:57 PM, Ben Menadue mailto:ben.mena...@nci.org.au> > wrote:
Hi,
Were there any changes to rmaps in going to 1.10.2? An otherwise-identical
setup that worked in 1
Hi,
Were there any changes to rmaps in going to 1.10.2? An otherwise-identical
setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that
there's no CPUs available in a socket...
With 1.10.0:
$ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
hostname
[r47:18709] m
n-mpi.org] On Behalf Of Mike Dubman
Sent: Thursday, 24 December 2015 7:14 AM
To: Open MPI Users
Subject: Re: [OMPI users] hcoll API in 1.10.1
Hi,
hcoll is part of MOFED or comes from HPCx.
what version of hcoll do you have on your system?
Thx
On Wed, Dec 23, 2015 at 4:58 AM, Ben Menadue ma
Hi,
It's probably in plain sight somewhere and I missed it, but is there a
minimum version of hcoll needed to build 1.10.1?
We have 2.0.0, which allows us to build 1.10.0, but 1.10.1 fails with
missing entries in the hcoll_collectives_t structure:
CC coll_hcoll_module.lo
../../../../../.
Hi PETSc and OpenMPI teams,
I'm running into a deadlock in PETSc 3.4.5 with OpenMPI 1.8.3:
1. PetscCommDestroy calls MPI_Attr_delete
2. MPI_Attr_delete acquires a lock
3. MPI_Attr_delete calls Petsc_DelComm_Outer (through a callback)
4. Petsc_DelComm_Outer calls MPI_Attr_get
5. MPI_Attr_get
30 matches
Mail list logo