Hi,
Looking at the *ompi/mca/coll/sm/coll_sm_module.c* it seems this module
will be used only if the calling communicator solely groups processes
within a node. I've got two questions here.
1. So is my understanding correct that for something like MPI_COMM_WORLD
where world is multiple processes
1) is correct. coll/sm is disqualified if the communicator is an inter
communicator or the communicator spans on several nodes.
you can have a look at the source code, and you will not that bcast does
not use send/recv. instead, it uses a shared memory, so hopefully, it is
faster than other mo
Thank you, Gilles.
What is the bcast I should look for? In general, how do I know which module
was used to for which communication - can I print this info?
On Jun 30, 2016 3:19 AM, "Gilles Gouaillardet" wrote:
> 1) is correct. coll/sm is disqualified if the communicator is an inter
> communicato
the Bcast in coll/sm
coll modules have priority
(see ompi_info --all)
for a given function (e,g. bcast) the module which implements it and has
the highest priority is used.
note a module can disqualify itself on a given communicator (e.g. coll/sm
on I ter node communucator).
by default, coll/tune
OK, I am beginning to see how it works now. One question I still have is,
in the case of a mult-node communicator it seems coll/tuned (or something
not coll/sm) well be the one used, so do they do any optimizations to
reduce communication within a node?
Also where can I find the p2p send recv modu
currently, coll/tuned is not topology aware.
this is something interesting, and everyone is invited to contribute.
coll/ml is topology aware, but it is kind of unmaintained now.
send/recv involves two abstraction layer
pml, and then the interconnect transport.
typically, pml/ob1 is used, and it us
Thank you, Gilles. The reason for digging into intra-node optimizations is
that we've implemented several machine learning applications in OpenMPI
(Java binding), but found collective communication to be a bottleneck,
especially when the number of procs per node is high. I've implemented a
shared m
you might want to give coll/ml a try
mpirun --mca coll_ml_priority 100 ...
Cheers,
Gilles
On Thursday, June 30, 2016, Saliya Ekanayake wrote:
> Thank you, Gilles. The reason for digging into intra-node optimizations is
> that we've implemented several machine learning applications in OpenMPI
>
OK, that's good. I'll try that.
So, is *ml* something not being developed now? Any documentation on this
component?
Thank you,
Saliya
On Thu, Jun 30, 2016 at 11:01 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
> you might want to give coll/ml a try
> mpirun --mca coll_ml_prior
I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g:
program output:
Testing -- big dataset test (bigdset)
Proc 3: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing -- big dataset test (
On 06/30/2016 09:49 AM, Orion Poplawski wrote:
> I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g:
I'll also note that I'm seeing this on 32-bit arm, but not i686 or x86_64.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder/CoR
Are the procs still alive? Is this on a single node?
> On Jun 30, 2016, at 8:49 AM, Orion Poplawski wrote:
>
> I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g:
>
> program output:
>
> Testing -- big dataset test (bigdset)
> Proc 3: *** Parallel ERROR ***
>VRF
I actually wouldn't advise ml. It *was* being developed as a joint project
between ORNL and Mellanox. I think that code eventually grew into what the
"hcoll" Mellanox library currently is.
As such, ml reflects kind of a middle point before hcoll became hardened into a
real product. It has so
No, just mpiexec is running. single node. Only see it when the test is
executed with "make check", not seeing it if I just run mpiexec -n 6
./testphdf5 by hand.
On 06/30/2016 09:58 AM, Ralph Castain wrote:
> Are the procs still alive? Is this on a single node?
>
>> On Jun 30, 2016, at 8:49 AM,
On 06/30/2016 10:33 AM, Orion Poplawski wrote:
> No, just mpiexec is running. single node. Only see it when the test is
> executed with "make check", not seeing it if I just run mpiexec -n 6
> ./testphdf5 by hand.
Hmm, now I'm seeing it running mpiexec by hand. Trying to check it via gdb
indic
So the application procs are all gone, but mpiexec isn’t exiting? I’d suggest
running valgrind, given the corruption.
> On Jun 30, 2016, at 10:21 AM, Orion Poplawski wrote:
>
> On 06/30/2016 10:33 AM, Orion Poplawski wrote:
>> No, just mpiexec is running. single node. Only see it when the tes
valgrind output:
$ valgrind mpiexec -n 6 ./testphdf5
==8518== Memcheck, a memory error detector
==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8518== Command: mpiexec -n 6 ./testphdf5
==8518==
=
On 06/30/2016 02:55 PM, Orion Poplawski wrote:
> valgrind output:
>
> $ valgrind mpiexec -n 6 ./testphdf5
> ==8518== Memcheck, a memory error detector
> ==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
> ==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright
Rats - and this only happens on arm32?
> On Jun 30, 2016, at 1:56 PM, Orion Poplawski wrote:
>
> On 06/30/2016 02:55 PM, Orion Poplawski wrote:
>> valgrind output:
>>
>> $ valgrind mpiexec -n 6 ./testphdf5
>> ==8518== Memcheck, a memory error detector
>> ==8518== Copyright (C) 2002-2015, and GN
19 matches
Mail list logo