Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-05-06 Thread Dave Love
Gus Correa  writes:

> Hi Giacomo
>
> Some programs fail with segmentation fault
> because the stack size is too small.

Yes, the default for Intel Fortran is to allocate large-ish amounts on
the stack, which may matter when the compiled program runs.

However, look at the backtrace.  It's apparently coming from the loader,
so something is pretty screwed up, though I can't guess what.  It would
help to have debugging symbols; always use at least -g and have
GNU/Linux distribution debuginfo packages to hand.

[Probably not relevant in this case, but I try to solve problems with
the Intel compiler and MPI (sorry Jeff et al) by persuading users to
avoid them.  GCC is more reliable in my experience, and the story about
its supposedly poor code generation isn't supported by experiment (if
that counts for anything these days).]

> [But others because of bugs in memory allocation/management, etc.]
>
> Have you tried
>
> ulimit -s unlimited
>
> before you run the program?
>
> Are you using a single machine or a cluster?
> If you're using infiniband you may need also to make the locked memory
> unlimited:
>
> ulimit -l unlimited
>
> I hope this helps,
> Gus Correa
>
> On 05/05/2016 05:15 AM, Giacomo Rossi wrote:
>>   gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
>> GNU gdb (GDB) 7.11
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> 
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-pc-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> .
>> Find the GDB manual and other documentation resources online at:
>> .
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
>> debugging symbols found)...done.
>> (gdb) r -v
>> Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x76858f38 in ?? ()
>> (gdb) bt
>> #0  0x76858f38 in ?? ()
>> #1  0x77de5828 in _dl_relocate_object () from
>> /lib64/ld-linux-x86-64.so.2
>> #2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
>> #3  0x77df029c in _dl_sysdep_start () from
>> /lib64/ld-linux-x86-64.so.2
>> #4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
>> #5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
>> #6  0x0002 in ?? ()
>> #7  0x7fffaa8a in ?? ()
>> #8  0x7fffaab6 in ?? ()
>> #9  0x in ?? ()
>>
>> Giacomo Rossi Ph.D., Space Engineer
>>
>> Research Fellow at Dept. of Mechanical and Aerospace Engineering,
>> "Sapienza" University of Rome
>> *p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e:
>> *giacom...@gmail.com 
>> 
>> Member of Fortran-FOSS-programmers
>> 


Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

2016-05-06 Thread Joshua Ladd
They had a port configured for Ethernet and did not exclude it. OpenIB
emits a warning about not finding a suitable CPC.

Josh

On Thu, May 5, 2016 at 9:06 PM, Andy Riebs  wrote:

> Sorry, my output listing was incomplete -- the program did run after the
> "No OpenFabrics" message, but (I presume) ran over Ethernet rather than
> InfiniBand. So I can't really say what was causing it to fail.
>
> Andy
>
>
> On 05/05/2016 06:09 PM, Nathan Hjelm wrote:
>
> It should work fine with ob1 (the default). Did you determine what was
> causing it to fail?
>
> -Nathan
>
> On Thu, May 05, 2016 at 06:04:55PM -0400, Andy Riebs wrote:
>
>For anyone like me who happens to google this in the future, the solution
>was to set OMPI_MCA_pml=yalla
>
>Many thanks Josh!
>
>On 05/05/2016 12:52 PM, Joshua Ladd wrote:
>
>  We are working with Andy offline.
>
>  Josh
>  On Thu, May 5, 2016 at 7:32 AM, Andy Riebs  
>  wrote:
>
>I've built 1.10.2 with all my favorite configuration options, but I
>get messages such as this (one for each rank with
>orte_base_help_aggregate=0) when I try to run on a MOFED system:
>
>$ shmemrun -H hades02,hades03 $PWD/shmem.out
>
> --
>No OpenFabrics connection schemes reported that they were able to be
>used on a specific port.  As such, the openib BTL (OpenFabrics
>support) will be disabled for this port.
>
>  Local host:   hades03
>  Local device: mlx4_0
>  Local port:   2
>  CPCs attempted:   rdmacm, udcm
>
> --
>
>My configure options:
>config_opts="--prefix=${INSTALL_DIR} \
>--without-mpi-param-check \
>--with-knem=/opt/mellanox/hpcx/knem \
>--with-mxm=/opt/mellanox/mxm  \
>--with-mxm-libdir=/opt/mellanox/mxm/lib \
>--with-fca=/opt/mellanox/fca \
>--with-pmi=${INSTALL_ROOT}/slurm \
>--without-psm --disable-dlopen \
>--disable-vt \
>--enable-orterun-prefix-by-default \
>--enable-debug-symbols"
>
>There aren't any obvious error messages in the build log -- what am I
>missing?
>
>Andy
>
>--
>Andy Riebs
>andy.ri...@hpe.com
>Hewlett-Packard Enterprise
>High Performance Computing Software Engineering
>+1 404 648 9024
>My opinions are not necessarily those of HPE
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post:
>http://www.open-mpi.org/community/lists/users/2016/05/29094.php
>
>  ___
>  users mailing list
>  us...@open-mpi.org
>  Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29100.php
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29101.php
>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29102.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29104.php
>


Re: [OMPI users] SLOAVx alltoallv

2016-05-06 Thread Joshua Ladd
Dave,

Ping me offlist about this at: joshual 'at' mellanox.com


Best,

Josh


On Fri, May 6, 2016 at 8:18 AM, Dave Love  wrote:

> At the risk of banging on too much about collectives:
>
> I came across a writeup of the "SLOAVx" algorithm for alltoallv
> .  It was implemented
> in OMPI with apparently good results, but I can't find any code.
>
> I wonder if anyone knows the story on that.  Was it not contributed, or
> is it actually not worthwhile?  Otherwise, might it be worth investigating?
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29113.php
>


Re: [OMPI users] Isend, Recv and Test

2016-05-06 Thread Gilles Gouaillardet
per the error message, you likely misspeled vader (e.g. missed the "r")

Jeff,
the behavior was initially reported on a single node, so the tcp btl is
unlikely used

Cheers,

Gilles

On Friday, May 6, 2016, Zhen Wang  wrote:

>
>
> 2016-05-05 9:27 GMT-05:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> >:
>
>> Out of curiosity, can you try
>> mpirun --mca btl self,sm ...
>>
> Same as before. Many MPI_Test calls.
>
>> and
>> mpirun --mca btl self,vader ...
>>
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host:  VirtualBox
> Framework: btl
> Component: vade
> --
> *** An error occurred in MPI_Init
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   mca_bml_base_open() failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [VirtualBox:2188] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[9235,1],0]
>   Exit code:1
> --
> [VirtualBox:02186] 1 more process has sent help message help-mca-base.txt
> / find-available:not-valid
> [VirtualBox:02186] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
>
>>
>> and see if one performs better than the other ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thursday, May 5, 2016, Zhen Wang > > wrote:
>>
>>> Gilles,
>>>
>>> Thanks for your reply.
>>>
>>> Best regards,
>>> Zhen
>>>
>>> On Wed, May 4, 2016 at 8:43 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
 Note there is no progress thread in openmpi 1.10
 from a pragmatic point of view, that means that for "large" messages,
 no data is sent in MPI_Isend, and the data is sent when MPI "progresses"
 e.g. call a MPI_Test, MPI_Probe, MPI_Recv or some similar subroutine.
 in your example, the data is transferred after the first usleep
 completes.

>>> I agree.
>>>

 that being said, it takes quite a while, and there could be an issue.
 what if you use MPI_Send instead () ?

>>> Works as expected.
>>>
>>> MPI 1: Recv of 0 started at 08:37:10.
>>> MPI 1: Recv of 0 finished at 08:37:10.
>>> MPI 0: Send of 0 started at 08:37:10.
>>> MPI 0: Send of 0 finished at 08:37:10.
>>>
>>>
 what if you send/Recv a large message first (to "warmup" connections),
 MPI_Barrier, and then start your MPI_Isend ?

>>> Not working. For what I want to accomplish, is my code the right way to
>>> go? Is there an altenative method? Thanks.
>>>
>>> MPI 1: Recv of 0 started at 08:38:46.
>>> MPI 0: Isend of 0 started at 08:38:46.
>>> MPI 0: Isend of 1 started at 08:38:46.
>>> MPI 0: Isend of 2 started at 08:38:46.
>>> MPI 0: Isend of 3 started at 08:38:46.
>>> MPI 0: Isend of 4 started at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:46.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 at 08:38:47.
>>> MPI 0: MPI_Test of 0 

Re: [OMPI users] SLOAVx alltoallv

2016-05-06 Thread Gilles Gouaillardet
Dave,

I briefly read the papers and it suggests the SLOAVx algorithm is
implemented by the ml collective module
this module had some issues and was judged not good for production.
it is disabled by default in the v1.10 series, and has been simply removed
from the v2.x branch.

you can either use (at your own risk ...) v1.10 or master with
mpirun --mca coll_ml_priority 100 ...

Cheers,

Gilles

On Friday, May 6, 2016, Dave Love  wrote:

> At the risk of banging on too much about collectives:
>
> I came across a writeup of the "SLOAVx" algorithm for alltoallv
> .  It was implemented
> in OMPI with apparently good results, but I can't find any code.
>
> I wonder if anyone knows the story on that.  Was it not contributed, or
> is it actually not worthwhile?  Otherwise, might it be worth investigating?
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29113.php
>


Re: [OMPI users] SLOAVx alltoallv

2016-05-06 Thread Joshua Ladd
It did not make it upstream.


Josh

On Fri, May 6, 2016 at 9:28 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Dave,
>
> I briefly read the papers and it suggests the SLOAVx algorithm is
> implemented by the ml collective module
> this module had some issues and was judged not good for production.
> it is disabled by default in the v1.10 series, and has been simply removed
> from the v2.x branch.
>
> you can either use (at your own risk ...) v1.10 or master with
> mpirun --mca coll_ml_priority 100 ...
>
> Cheers,
>
> Gilles
>
> On Friday, May 6, 2016, Dave Love  wrote:
>
>> At the risk of banging on too much about collectives:
>>
>> I came across a writeup of the "SLOAVx" algorithm for alltoallv
>> .  It was implemented
>> in OMPI with apparently good results, but I can't find any code.
>>
>> I wonder if anyone knows the story on that.  Was it not contributed, or
>> is it actually not worthwhile?  Otherwise, might it be worth
>> investigating?
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29113.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29120.php
>


[OMPI users] Error building openmpi-dev-4010-g6c9d65c on Linux with Sun C

2016-05-06 Thread Siegmar Gross

Hi,

today I tried to build openmpi-dev-4010-g6c9d65c on my
machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE
Linux 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was
successful on most machines, but I got the following error
on my Linux machine for the Sun C compiler.

tyr openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc 123 tail -7 
log.make.Linux.x86_64.64_cc
"../../../../../openmpi-dev-4010-g6c9d65c/opal/mca/reachable/netlink/reachable_netlink_utils_common.c", line 322: warning: extern inline function 
"nl_object_priv" not defined in translation unit

cc: Fatal error in /opt/sun/solarisstudio12.4/lib/compilers/acomp : Signal 
number = 11
make[2]: *** [reachable_netlink_utils_common.lo] Error 1
make[2]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc/opal/mca/reachable/netlink'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc/opal'
make: *** [all-recursive] Error 1
tyr openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc 124


I would be grateful, if somebody can fix the problem.
Thank you very much for any help in advance.


Kind regards

Siegmar


Re: [OMPI users] Error building openmpi-dev-4010-g6c9d65c on Linux with Sun C

2016-05-06 Thread Gilles Gouaillardet
Siegmar,

at first glance, this looks like a crash of the compiler.
so I guess the root cause is not openmpi
(that being said, a workaround could be implemented in openmpi)

Cheers,

Gilles

On Saturday, May 7, 2016, Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
>
> today I tried to build openmpi-dev-4010-g6c9d65c on my
> machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE
> Linux 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was
> successful on most machines, but I got the following error
> on my Linux machine for the Sun C compiler.
>
> tyr openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc 123 tail -7
> log.make.Linux.x86_64.64_cc
> "../../../../../openmpi-dev-4010-g6c9d65c/opal/mca/reachable/netlink/reachable_netlink_utils_common.c",
> line 322: warning: extern inline function "nl_object_priv" not defined in
> translation unit
> cc: Fatal error in /opt/sun/solarisstudio12.4/lib/compilers/acomp : Signal
> number = 11
> make[2]: *** [reachable_netlink_utils_common.lo] Error 1
> make[2]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc/opal/mca/reachable/netlink'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc/opal'
> make: *** [all-recursive] Error 1
> tyr openmpi-dev-4010-g6c9d65c-Linux.x86_64.64_cc 124
>
>
> I would be grateful, if somebody can fix the problem.
> Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29122.php
>


[OMPI users] No core dump in some cases

2016-05-06 Thread dpchoudh .
Hello all

I run MPI jobs (for test purpose only) on two different 'clusters'. Both
'clusters' have two nodes only, connected back-to-back. The two are very
similar, but not identical, both software and hardware wise.

Both have ulimit -c set to unlimited. However, only one of the two creates
core files when an MPI job crashes. The other creates a text file named
something like
.80s-,.btr

I'd much prefer a core file because that allows me to debug with a lot more
options than a static text file with addresses. How do I get a core file in
all situations? I am using MPI source from the master branch.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.


Re: [OMPI users] Isend, Recv and Test

2016-05-06 Thread Zhen Wang
Jeff,

The hardware limitation doesn't allow me to use anything other than TCP...

I think I have a good understanding of what's going on, and may have a
solution. I'll test it out. Thanks to you all.

Best regards,
Zhen

On Fri, May 6, 2016 at 7:13 AM, Jeff Squyres (jsquyres) 
wrote:

> On May 5, 2016, at 10:09 PM, Zhen Wang  wrote:
> >
> > It's taking so long because you are sleeping for .1 second between
> calling MPI_Test().
> >
> > The TCP transport is only sending a few fragments of your message during
> each iteration through MPI_Test (because, by definition, it has to return
> "immediately").  Other transports do better handing off large messages like
> this to hardware for asynchronous progress.
> > This agrees with what I observed. Larger messages needs more calls of
> MPI_Test. What do you mean by other transports?
>
> The POSIX sockets API, commonly used with TCP over Ethernet, is great for
> most network-based applications, but it has some inherent constraints that
> limit its performance in HPC types of applications.
>
> That being said, many people just take a bunch of servers and run MPI over
> over TCP/Ethernet, and it works well enough for them.  Because of this
> "good enough" performance, and the fact that every server in the world
> supports some type of Ethernet capability, all MPI implementations support
> TCP.
>
> But there are more demanding HPC applications that require higher
> performance from the network in order to get good overall performance.  As
> such, other networking APIs -- most commonly provided by vendors for
> HPC-class networks (Ethernet or otherwise) -- do not have the same
> performance constraints as the POSIX sockets API, and are usually preferred
> by HPC applications.
>
> There's usually two kinds of performance improvements that such networking
> APIs offer (in conjunction with the underlying NIC for the HPC-class
> network):
>
> 1. Improving software API efficiency (e.g., avoid extra memory copies,
> bypassing the OS and exposing NIC hardware directly into userspace, etc.)
>
> 2. Exploiting NIC hardware capabilities, usually designed for MPI and/or
> general high performance (e.g., polling for progress instead of waiting for
> interrupts, hardware demultiplex of incoming messages directly to target
> processes, direct data placement at the target, etc.)
>
> Hence, when I say "other transports", I'm referring to these HPC-class
> networks (and associated APIs).
>
> > Additionally, in the upcoming v2.0.0 release is a non-default option to
> enable an asynchronous progress thread for the TCP transport.  We're up to
> v2.0.0rc2; you can give that async TCP support a whirl, if you want.  Pass
> "--mca btl_tcp_progress_thread 1" on the mpirun command line to enable the
> TCP progress thread to try it.
> > Does this mean there's an additional thread to transfer data in
> background?
>
> Yes.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29112.php
>


Re: [OMPI users] barrier algorithm 5

2016-05-06 Thread Gilles Gouaillardet

Dave,


i made PR #1644 to abort with a user friendly error message

https://github.com/open-mpi/ompi/pull/1644


Cheers,


Gilles


On 5/5/2016 2:05 AM, Dave Love wrote:

Gilles Gouaillardet  writes:


Dave,

yes, this is for two MPI tasks only.

the MPI subroutine could/should return with an error if the communicator is
made of more than 3 tasks.
an other option would be to abort at initialization time if no collective
modules provide a barrier implementation.
or maybe the tuned module should have not used the two_procs algorithm, but
what should it do instead ? use a default one ? do not implement barrier ?
warn/error the end user ?

note the error message might be a bit obscure.

I write "could" because you explicitly forced something that cannot work,
and I am not convinced OpenMPI should protect end users from themselves,
even when they make an honest mistake.

I just looped over the available algorithms, not expecting any not to
work.  One question is how I'd know it can't work; I can't find
documentation on the algorithms, just the more-or-less suggestive names
that I might be able to find in the literature, or not.  Is there a good
place to look?

In the absence of a good reason why not -- I haven't looked at the code
-- but I'd expect it to abort with a message about the algorithm being
limited to two processes at some stage.  Of course, this isn't a common
case, and people probably have more important things to do.
___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29083.php





Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-05-06 Thread Giacomo Rossi
Yes, I've tried three simple "Hello world" programs in fortan, C and C++
and the compile and run with intel 16.0.3. The problem is with the openmpi
compiled from source.

Giacomo Rossi Ph.D., Space Engineer

Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza"
University of Rome
*p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e: *giacom...@gmail.com

Member of Fortran-FOSS-programmers



2016-05-05 11:15 GMT+02:00 Giacomo Rossi :

>  gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
> GNU gdb (GDB) 7.11
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-pc-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
> debugging symbols found)...done.
> (gdb) r -v
> Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x76858f38 in ?? ()
> (gdb) bt
> #0  0x76858f38 in ?? ()
> #1  0x77de5828 in _dl_relocate_object () from
> /lib64/ld-linux-x86-64.so.2
> #2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
> #3  0x77df029c in _dl_sysdep_start () from
> /lib64/ld-linux-x86-64.so.2
> #4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
> #5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
> #6  0x0002 in ?? ()
> #7  0x7fffaa8a in ?? ()
> #8  0x7fffaab6 in ?? ()
> #9  0x in ?? ()
>
> Giacomo Rossi Ph.D., Space Engineer
>
> Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza"
> University of Rome
> *p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e: *giacom...@gmail.com
> 
> Member of Fortran-FOSS-programmers
> 
>
>
> 2016-05-05 10:44 GMT+02:00 Giacomo Rossi :
>
>> Here the result of ldd command:
>> 'ldd /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
>> linux-vdso.so.1 (0x7ffcacbbe000)
>> libopen-pal.so.13 =>
>> /opt/openmpi/1.10.2/intel/16.0.3/lib/libopen-pal.so.13 (0x7fa9597a9000)
>> libm.so.6 => /usr/lib/libm.so.6 (0x7fa9594a4000)
>> libpciaccess.so.0 => /usr/lib/libpciaccess.so.0 (0x7fa95929a000)
>> libdl.so.2 => /usr/lib/libdl.so.2 (0x7fa959096000)
>> librt.so.1 => /usr/lib/librt.so.1 (0x7fa958e8e000)
>> libutil.so.1 => /usr/lib/libutil.so.1 (0x7fa958c8b000)
>> libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7fa958a75000)
>> libpthread.so.0 => /usr/lib/libpthread.so.0 (0x7fa958858000)
>> libc.so.6 => /usr/lib/libc.so.6 (0x7fa9584b7000)
>> libimf.so =>
>> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libimf.so
>> (0x7fa957fb9000)
>> libsvml.so =>
>> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libsvml.so
>> (0x7fa9570ad000)
>> libirng.so =>
>> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libirng.so
>> (0x7fa956d3b000)
>> libintlc.so.5 =>
>> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libintlc.so.5
>> (0x7fa956acf000)
>> /lib64/ld-linux-x86-64.so.2 (0x7fa959ab9000)'
>>
>> I can't provide a core file, because I can't compile or launch any
>> program with mpifort... I've always the error 'core dumped' also when I try
>> to compile a program with mpifort, and of course there isn't any core file.
>>
>>
>> Giacomo Rossi Ph.D., Space Engineer
>>
>> Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza"
>> University of Rome
>> *p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e: *
>> giacom...@gmail.com
>> 
>> Member of Fortran-FOSS-programmers
>> 
>>
>>
>> 2016-05-05 8:50 GMT+02:00 Giacomo Rossi :
>>
>>> I’ve installed the latest version of Intel Parallel Studio (16.0.3),
>>> then I’ve downloaded the latest version of openmpi (1.10.2) and I’ve
>>> compiled it with
>>>
>>> `./configure CC=icc CXX=icpc F77=ifort FC=ifort
>>> --prefix=/opt/openmpi/1.10.2/intel/16.0.3`
>>>
>>> then I've installed and everything seems ok, but when I try the simple
>>> command
>>>
>>> ' /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v'
>>>
>>> I receive the following error
>>>
>>> 'Segmentation fault (core dumped)'
>>>
>>> I'm on ArchLinux, with kernel 4.5.1-1-ARCH; I've atta

Re: [OMPI users] Multiple Non-blocking Send/Recv calls with MPI_Waitall fails when CUDA IPC is in use

2016-05-06 Thread Jiri Kraus
Hi Iman,

How are you handling GPU affinity? Are you using CUDA_VISIBLE_DEVICES for that? 
If yes can you try using cudaSetDevice in your application instead? 
Also when multiple processes are assigned to a single GPU are you using MPS and 
what GPUs are your running this on?

Hope this Helps

Jiri

> Message: 2
> Date: Wed, 4 May 2016 15:55:20 -0400
> From: Iman Faraji 
> To: us...@open-mpi.org
> Subject: [OMPI users] Multiple Non-blocking Send/Recv calls with
>   MPI_Waitall fails when CUDA IPC is in use
> Message-ID:
>l.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi there,
> 
> I am using multiple MPI non-blocking send receives on the GPU buffer
> followed by a waitall at the end; I also repeat this process multiple times.
> 
> The MPI version that I am using 1.10.2.
> 
> When multiple processes are assigned to a single GPU (or when CUDA IPC is
> used), I get the following error at the beginning
> 
> The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will
> cause the program to abort.
>   cuIpcGetEventHandle return value:   1
> 
> and this at the end of my benchmark
> 
> The call to cuEventDestory failed. This is a unrecoverable error and will 
> cause
> the program to abort.
>   cuEventDestory return value:   400
> Check the cuda.h file for what the return value means.
> 
> 
> *Note1: *
> 
> This error doesn't appear if only one iteration of the non-blocking
> send/receive call is used (i.e., using MPI_Waitall only once )
> 
> This error doesn't appear if multiple iterations are used by MPI_Waitall is 
> not
> included.
> 
> *Note 2:*
> 
> This error doesn't exist if the buffer is is allocated on the host.
> 
> *Note 3:*
> 
> This error doesn't exist if cuda_ipc is disabled or OMPI version 1.8.8 is 
> used.
> 
> 
> I'd appreciate if you let me know what causes this issue and how it can be
> resolved.
> 
> Regards,
> Iman

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] [open-mpi/ompi] COMM_SPAWN broken on Solaris/v1.10 (#1569)

2016-05-06 Thread Gilles Gouaillardet

Siegmar,

i was unable to reproduce the issue with one solaris 11 x86_64 VM and 
one linux x86_64 VM



what is the minimal configuration you need to reproduce the issue ?

are you able to reproduce the issue with only x86_64 nodes ?

i was under the impression that solaris vs linux is the issue, but is 
big vs little endian instead ?



Cheers,


Gilles


On 5/5/2016 9:13 PM, Siegmar Gross wrote:

Hi Gilles,

is the following output helpful to find the error? I've put
another output below the output from gdb, which shows that
things are a little bit "random" if I use only 3+2 or 4+1
Sparc machines.


tyr spawn 127 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show 
copying"

and "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.10".
For bug reporting instructions, please see:
...
Reading symbols from 
/export2/prog/SunOS_sparc/openmpi-1.10.3_64_cc/bin/orterun...done.
(gdb) set args -np 1 --host tyr,sunpc1,linpc1,ruester 
spawn_multiple_master

(gdb) run
Starting program: /usr/local/openmpi-1.10.3_64_cc/bin/mpiexec -np 1 
--host tyr,sunpc1,linpc1,ruester spawn_multiple_master

[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP2]

Parent process 0 running on tyr.informatik.hs-fulda.de
  I create 3 slave processes.

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
(proc_pointer))->obj_magic_id, file 
../../openmpi-v1.10.2-163-g42da15d/ompi/group/group_init.c, line 215, 
function ompi_group_increment_proc_count

[ruester:17809] *** Process received signal ***
[ruester:17809] Signal: Abort (6)
[ruester:17809] Signal code:  (-1)
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:opal_backtrace_print+0x1c 


/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:0x1b10f0
/lib/sparcv9/libc.so.1:0xd8c28
/lib/sparcv9/libc.so.1:0xcc79c
/lib/sparcv9/libc.so.1:0xcc9a8
/lib/sparcv9/libc.so.1:__lwp_kill+0x8 [ Signal 2091943080 (?)]
/lib/sparcv9/libc.so.1:abort+0xd0
/lib/sparcv9/libc.so.1:_assert_c99+0x78
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_group_increment_proc_count+0x10c 


/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0xe758
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0x113d4
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_mpi_init+0x188c 


/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:MPI_Init+0x26c
/home/fd1026/SunOS/sparc/bin/spawn_slave:main+0x18
/home/fd1026/SunOS/sparc/bin/spawn_slave:_start+0x108
[ruester:17809] *** End of error message ***
-- 

mpiexec noticed that process rank 2 with PID 0 on node ruester exited 
on signal 6 (Abort).
-- 


[LWP2 exited]
[New Thread 2]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found 
to satisfy query

(gdb) bt
#0  0x7f6173d0 in rtld_db_dlactivity () from 
/usr/lib/sparcv9/ld.so.1

#1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8  0x7e5f9718 in dlopen_close (handle=0x100)
at 
../../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/dl/dlopen/dl_dlopen_module.c:144

#9  0x7e5f364c in opal_dl_close (handle=0xff7d700200ff)
at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/dl/base/dl_base_fns.c:53

#10 0x7e546714 in ri_destructor (obj=0x1200)
at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_component_repository.c:357
#11 0x7e543840 in opal_obj_run_destructors 
(object=0xff7f607a6cff)
at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/class/opal_object.h:451
#12 0x7e545f54 in mca_base_component_repository_release 
(component=0xff7c801df0ff)
at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_component_repository.c:223
#13 0x7e54d0d8 in mca_base_component_unload 
(component=0xff7d3000, output_id=-1610596097)
at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_components_close.c:47
#14 0x7e54d17c in mca_base_component_close (component=0x100, 
output_id=-1878702080)
at 
../../../../openmpi-v1.10.2-163-g42d

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-05-06 Thread Jeff Squyres (jsquyres)
Ok, good.

I asked that question because typically when we see errors like this, it is 
usually either a busted compiler installation or inadvertently mixing the 
run-times of multiple different compilers in some kind of incompatible way.  
Specifically, the mpifort (aka mpif90) application is a fairly simple program 
-- there's no reason it should segv, especially with a stack trace that you 
sent that implies that it's dying early in startup, potentially even before it 
has hit any Open MPI code (i.e., it could even be pre-main).

BTW, you might be able to get a more complete stack trace from the debugger 
that comes with the Intel compiler (idb?  I don't remember offhand).

Since you are able to run simple programs compiled by this compiler, it sounds 
like the compiler is working fine.  Good!

The next thing to check is to see if somehow the compiler and/or run-time 
environments are getting mixed up.  E.g., the apps were compiled for one 
compiler/run-time but are being used with another.  Also ensure that any 
compiler/linker flags that you are passing to Open MPI's configure script are 
native and correct for the platform for which you're compiling (e.g., don't 
pass in flags that optimize for a different platform; that may result in 
generating machine code instructions that are invalid for your platform).

Try recompiling/re-installing Open MPI from scratch, and if it still doesn't 
work, then send all the information listed here:

https://www.open-mpi.org/community/help/


> On May 6, 2016, at 3:45 AM, Giacomo Rossi  wrote:
> 
> Yes, I've tried three simple "Hello world" programs in fortan, C and C++ and 
> the compile and run with intel 16.0.3. The problem is with the openmpi 
> compiled from source.
> 
> Giacomo Rossi Ph.D., Space Engineer
> 
> Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza" 
> University of Rome
> p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com
> 
> Member of Fortran-FOSS-programmers
> 
> 
> 2016-05-05 11:15 GMT+02:00 Giacomo Rossi :
>  gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
> GNU gdb (GDB) 7.11
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-pc-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no 
> debugging symbols found)...done.
> (gdb) r -v
> Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x76858f38 in ?? ()
> (gdb) bt
> #0  0x76858f38 in ?? ()
> #1  0x77de5828 in _dl_relocate_object () from 
> /lib64/ld-linux-x86-64.so.2
> #2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
> #3  0x77df029c in _dl_sysdep_start () from /lib64/ld-linux-x86-64.so.2
> #4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
> #5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
> #6  0x0002 in ?? ()
> #7  0x7fffaa8a in ?? ()
> #8  0x7fffaab6 in ?? ()
> #9  0x in ?? ()
> 
> Giacomo Rossi Ph.D., Space Engineer
> 
> Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza" 
> University of Rome
> p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com
> 
> Member of Fortran-FOSS-programmers
> 
> 
> 2016-05-05 10:44 GMT+02:00 Giacomo Rossi :
> Here the result of ldd command:
> 'ldd /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
>   linux-vdso.so.1 (0x7ffcacbbe000)
>   libopen-pal.so.13 => 
> /opt/openmpi/1.10.2/intel/16.0.3/lib/libopen-pal.so.13 (0x7fa9597a9000)
>   libm.so.6 => /usr/lib/libm.so.6 (0x7fa9594a4000)
>   libpciaccess.so.0 => /usr/lib/libpciaccess.so.0 (0x7fa95929a000)
>   libdl.so.2 => /usr/lib/libdl.so.2 (0x7fa959096000)
>   librt.so.1 => /usr/lib/librt.so.1 (0x7fa958e8e000)
>   libutil.so.1 => /usr/lib/libutil.so.1 (0x7fa958c8b000)
>   libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7fa958a75000)
>   libpthread.so.0 => /usr/lib/libpthread.so.0 (0x7fa958858000)
>   libc.so.6 => /usr/lib/libc.so.6 (0x7fa9584b7000)
>   libimf.so => 
> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib/intel64/libimf.so
>  (0x7fa957fb9000)
>   libsvml.so => 
> /home/giacomo/intel/compilers_and_libraries_2016.3.210/linux/compiler/lib

Re: [OMPI users] Isend, Recv and Test

2016-05-06 Thread Jeff Squyres (jsquyres)
On May 5, 2016, at 10:09 PM, Zhen Wang  wrote:
> 
> It's taking so long because you are sleeping for .1 second between calling 
> MPI_Test().
> 
> The TCP transport is only sending a few fragments of your message during each 
> iteration through MPI_Test (because, by definition, it has to return 
> "immediately").  Other transports do better handing off large messages like 
> this to hardware for asynchronous progress.
> This agrees with what I observed. Larger messages needs more calls of 
> MPI_Test. What do you mean by other transports?

The POSIX sockets API, commonly used with TCP over Ethernet, is great for most 
network-based applications, but it has some inherent constraints that limit its 
performance in HPC types of applications.

That being said, many people just take a bunch of servers and run MPI over over 
TCP/Ethernet, and it works well enough for them.  Because of this "good enough" 
performance, and the fact that every server in the world supports some type of 
Ethernet capability, all MPI implementations support TCP.

But there are more demanding HPC applications that require higher performance 
from the network in order to get good overall performance.  As such, other 
networking APIs -- most commonly provided by vendors for HPC-class networks 
(Ethernet or otherwise) -- do not have the same performance constraints as the 
POSIX sockets API, and are usually preferred by HPC applications.  

There's usually two kinds of performance improvements that such networking APIs 
offer (in conjunction with the underlying NIC for the HPC-class network):

1. Improving software API efficiency (e.g., avoid extra memory copies, 
bypassing the OS and exposing NIC hardware directly into userspace, etc.)

2. Exploiting NIC hardware capabilities, usually designed for MPI and/or 
general high performance (e.g., polling for progress instead of waiting for 
interrupts, hardware demultiplex of incoming messages directly to target 
processes, direct data placement at the target, etc.)

Hence, when I say "other transports", I'm referring to these HPC-class networks 
(and associated APIs).

> Additionally, in the upcoming v2.0.0 release is a non-default option to 
> enable an asynchronous progress thread for the TCP transport.  We're up to 
> v2.0.0rc2; you can give that async TCP support a whirl, if you want.  Pass 
> "--mca btl_tcp_progress_thread 1" on the mpirun command line to enable the 
> TCP progress thread to try it.
> Does this mean there's an additional thread to transfer data in background? 

Yes.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] SLOAVx alltoallv

2016-05-06 Thread Dave Love
At the risk of banging on too much about collectives:

I came across a writeup of the "SLOAVx" algorithm for alltoallv
.  It was implemented
in OMPI with apparently good results, but I can't find any code.

I wonder if anyone knows the story on that.  Was it not contributed, or
is it actually not worthwhile?  Otherwise, might it be worth investigating?


Re: [OMPI users] barrier algorithm 5

2016-05-06 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
>
> i made PR #1644 to abort with a user friendly error message
>
> https://github.com/open-mpi/ompi/pull/1644

Thanks.  Could there be similar cases that might be worth a change?


Re: [OMPI users] [open-mpi/ompi] COMM_SPAWN broken on Solaris/v1.10 (#1569)

2016-05-06 Thread Siegmar Gross

Hi Gilles,

today I'm building all current versions with both compilers on
my machines. Unfortunately it takes some hours, because especially
my Solaris Sparc machine is old and slow. Yesterday I've had problems
using two Sparc machines and nothing else. Tonight the new versions
will be copied to all machines so that I can test them tomorrow.

Are the following two commands equivalent? Is the second one correct,
if "loki" has two sockets with six cores each?

mpiexec -np 3 --host loki,loki,loki hello_1_mpi
mpiexec -np 3 --host loki --slot-list 0:0-5,1:0-5 hello_1_mpi


Kind regards

Siegmar

Am 06.05.2016 um 10:36 schrieb Gilles Gouaillardet:

Siegmar,

i was unable to reproduce the issue with one solaris 11 x86_64 VM and one
linux x86_64 VM


what is the minimal configuration you need to reproduce the issue ?

are you able to reproduce the issue with only x86_64 nodes ?

i was under the impression that solaris vs linux is the issue, but is big vs
little endian instead ?


Cheers,


Gilles