[OMPI users] MPI_CANCEL for nonblocking collective communication

2017-06-09 Thread Markus
Dear MPI Users and Maintainers,

I am using openMPI in version 1.10.4 with enabled multithread support and
java bindings. I use MPI in java, having one process per machine and
multiple threads per process.

I was trying to build a broadcast listener thread which calls MPI_iBcast,
followed by MPI_WAIT.

I use the request object, which is returned by MPI_iBcast, to shut the
listener down, calling MPI-CANCEL for that request from the main thread.
This results in

[fe-402-1:2972] *** An error occurred in MPI_Cancel
[fe-402-1:2972] *** reported by process [1275002881,17179869185
<(717)%20986-9185>]
[fe-402-1:2972] *** on communicator MPI_COMM_WORLD
[fe-402-1:2972] *** MPI_ERR_REQUEST: invalid request
[fe-402-1:2972] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[fe-402-1:2972] ***and potentially your MPI job)


Which indicates that the request is invalid in some fashion. I already
checked that it is not null (MPI_REQUEST_NULL). I have also set up a simple
testbed, where nothing else happens, except that one broadcast. The request
object is always invalid, no matter from where i call cancel().

As far as I understand the MPI specifications, cancel is also supposed to
work for collective nonblocking communication (which includes my
broadcasts). I haven't found any advice yet, so I hope to find some help in
this mailing list.

Kind regards,
Markus Jeromin

PS: Testbed for calling mpi cancel, written in Java.
___

package distributed.mpi;

import java.nio.ByteBuffer;

import mpi.MPI;
import mpi.MPIException;
import mpi.Request;

/**
 * Testing MPI_CANCEL on MPI_iBcast.
 * Program does not terminate because the listeners are still running and
 * waiting for the java native call MPI_WAIT to return. MPI_CANCEL is
called, but
 * the listener never unblocks (i.e. the MPI_WAIT never returns)
 *
 * @author mjeromin
 *
 */
public class BroadcastTestCancel {

static int myrank;

/**
* Listener that waits for incoming broadcasts from specified root. Uses
* asynchronous MPI_iBcast and MPI_WAIT
*
*/
static class Listener extends Thread {

ByteBuffer b = ByteBuffer.allocateDirect(100);
public Request req = null;

@Override
public void run() {
super.run();
try {
req = MPI.COMM_WORLD.iBcast(b, b.limit(), MPI.BYTE, 0);
System.out.println(myrank + ": waiting for bcast (that will never come)");
req.waitFor();
} catch (MPIException e) {
e.printStackTrace();
}
System.out.println(myrank + ": listener unblocked");
}
}

public static void main(String[] args) throws MPIException,
InterruptedException {

// we need full thread support
int threadSupport = MPI.InitThread(args, MPI.THREAD_MULTIPLE);
if (threadSupport != MPI.THREAD_MULTIPLE) {
System.out.println(myrank + ": no multithread support. Aborting.");
MPI.Finalize();
return;
}

// disable or enable exceptions, it does not matter at all.
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);

myrank = MPI.COMM_WORLD.getRank();

// start receiving listeners, but no sender (which would be node 0)
if (myrank > 0) {
Listener l = new Listener();
l.start();

// let the listener reach at waitFor()
Thread.sleep(5000);

// call MPI_CANCEL (matching send will never arrive)
try {
l.req.cancel();
} catch (MPIException e) {
// depends on error handler
System.out.println(myrank + ": MPI Exception \n" + e.toString());
}
}

// don't call MPI_FINISH too early. (not that necessary to wait here, but
just to be sure)
Thread.sleep(15000);

System.out.println(myrank + ": calling finish");
MPI.Finalize();
System.out.println(myrank + ": finished");
}

}
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Deadlocks and warnings from libevent when using MPI_THREAD_MULTIPLE

2014-04-25 Thread Markus Wittmann

Hi everyone,

I'm using the current Open MPI 1.8.1 release and observe
non-deterministic deadlocks and warnings from libevent when using
MPI_THREAD_MULTIPLE. Open MPI has been configured with
--enable-mpi-thread-multiple --with-tm --with-verbs (see attached
config.log)

Attached is a sample application that spawns a thread for each process
after MPI_Init_thread has been called. The thread then calls MPI_Recv
which blocks until the matching MPI_Send is called just before
MPI_Finalize is called in the main thread. (AFAIK MPICH uses such kind
of facility to implement a progress thread) Meanwhile the main thread
exchanges data with its right/left neighbor via ISend/IRecv.

I only see this, when the MPI processes run on separate nodes like in
the following:

$ mpiexec -n 2 -map-by node ./test
[0] isend/irecv.
[0] progress thread...
[0] waitall.
[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one 
event_base_loop can run on each event_base at once.

[1] isend/irecv.
[1] progress thread...
[1] waitall.
[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one 
event_base_loop can run on each event_base at once.




Can anybody confirm this?

Best regards,
Markus

--
Markus Wittmann, HPC Services
Friedrich-Alexander-Universität Erlangen-Nürnberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
http://www.rrze.fau.de/hpc/


info.tar.bz2
Description: Binary data
// Compile with: mpicc test.c -pthread -o test
#include 
#include 
#include 
#include 
#include 


static void * ProgressThread(void * ptRank)
{
int buffer = 0xCDEFCDEF;
int rank = *((int *)ptRank);
int error;

printf("[%d] progress thread...\n", rank);
MPI_Recv(&buffer, 1, MPI_INT, rank, 999, MPI_COMM_WORLD, 
MPI_STATUS_IGNORE);

return NULL;
}


int main(int argc, char * argv[])
{
int rank = -1;
int size = -1;
int bufferSend = 0;
int bufferRecv = 0;
int requested = MPI_THREAD_MULTIPLE;
int provided = -1;
int error;
pthread_t thread;
MPI_Request requests[2];

MPI_Init_thread(&argc, &argv, requested, &provided);

if (requested != provided) {
printf("error: requested %d != provided %d\n", requested, 
provided);
exit(1);
}

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

error = pthread_create(&thread, NULL, &ProgressThread, &rank);

if (error != 0) {
fprintf(stderr, "pthread_create failed (%d): %s\n", error, 
strerror(error));
}


printf("[%d] isend/irecv.\n", rank);
MPI_Isend(&bufferSend, 1, MPI_INT, (rank + 1) % size, 0, 
MPI_COMM_WORLD, &requests[0]);
MPI_Irecv(&bufferRecv, 1, MPI_INT, (rank - 1 + size) % size, 0, 
MPI_COMM_WORLD, &requests[1]);

printf("[%d] waitall.\n", rank);
MPI_Waitall(2, requests, MPI_STATUS_IGNORE);

printf("[%d] send.\n", rank);
MPI_Send(&bufferSend, 1, MPI_INT, rank, 999, MPI_COMM_WORLD);

error = pthread_join(thread, NULL);

if (error != 0) {
fprintf(stderr, "pthread_join failed (%d): %s\n", error, 
strerror(error));
}

printf("[%d] done.\n", rank);

MPI_Finalize();

return 0;
}


Re: [OMPI users] Deadlocks and warnings from libevent when using MPI_THREAD_MULTIPLE

2014-04-26 Thread Markus Wittmann

Am 25.04.2014 23:40, schrieb Ralph Castain:

We don't fully support THREAD_MULTIPLE, and most definitely not when
using IB. We are planning on extending that coverage in the 1.9
series

Ah OK, thanks for the fast reply.

--
Markus Wittmann, HPC Services
Friedrich-Alexander-Universität Erlangen-Nürnberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-20104
markus.wittm...@fau.de
http://www.rrze.fau.de/hpc/


[OMPI users] [R] Short survey concerning the use of software engineering in the field of High Performance Computing

2010-08-31 Thread Markus Schmidberger

Dear Colleagues,

this is a short survey (21 questions that take about 10 minutes to 
answer) in context of the research work for my PhD thesis and the Munich 
Center of Advanced Computing (Project B2). It would be very helpful, if 
you will take the time to answer my questions concerning the use of 
software engineering in the field of High Performance Computing.


Please note, all questions are mandatory to answer!

http://www.q-set.de/q-set.php?sCode=TCSBHMPZAASZ


Thank you very much, kind regards

Miriam Schmidberger
(Dipl. Medien-Inf.)

schmi...@in.tum.de

Technische Universität München
Institut für Informatik
Boltzmannstr. 3
85748 Garching
Germany
Office 01.07.037
Tel: +49 (89) 289-18226


[OMPI users] Running your MPI application on a Computer Cluster in the Cloud - cloudnumbers.com

2011-07-13 Thread Markus Schmidberger
Dear MPI users and experts,

cloudnumbers.com provides researchers and companies with the resources
to perform high performance calculations in the cloud. As
cloudnumbers.com's community manager I may invite you to register and
test your MPI application on a computer cluster in the cloud for free:
http://my.cloudnumbers.com/register

Our aim is to change the way of research collaboration is done today by
bringing together scientists and businesses from all over the world on a
single platform. cloudnumbers.com is a Berlin (Germany) based
international high-tech startup striving for enabling everyone to
benefit from the High Performance Computing related advantages of the
cloud. We provide easy access to applications running on any kind of
computer hardware: from single core high memory machines up to 1000
cores computer clusters.

Our platform provides several advantages:

* Turn fixed into variable costs and pay only for the capacity you need.
Watch our latest saving costs with cloudnumbers.com video:
http://www.youtube.com/watch?v=ln_BSVigUhg&feature=player_embedded

* Enter the cloud using an intuitive and user friendly platform. Watch
our latest cloudnumbers.com in a nutshell video:
http://www.youtube.com/watch?v=0ZNEpR_ElV0&feature=player_embedded

* Be released from ongoing technological obsolescence and continuous
maintenance costs (e.g. linking to libraries or system dependencies)

* Accelerated your C, C++, Fortran, R, Python, ... calculations through
parallel processing and great computing capacity - more than 1000 cores
are available and GPUs are coming soon.

* Share your results worldwide (coming soon).

* Get high speed access to public databases (please let us know, if your
favorite database is missing!).

* We have developed a security architecture that meets high requirements
of data security and privacy. Read our security white paper:
http://d1372nki7bx5yg.cloudfront.net/wp-content/uploads/2011/06/cloudnumberscom-security.whitepaper.pdf


This is only a selection of our top features. To get more information
check out our web-page (http://www.cloudnumbers.com/) or follow our blog
about cloud computing, HPC and HPC applications:
http://cloudnumbers.com/blog

Register and test for free now at cloudnumbers.com:
http://my.cloudnumbers.com/register

We are looking forward to get your feedback and consumer insights. Take
the chance and have an impact to the development of a new cloud
computing calculation platform.

Best
Markus


-- 
Dr. rer. nat. Markus Schmidberger 
Senior Community Manager 

Cloudnumbers.com GmbH
Chausseestraße 6
10119 Berlin 

www.cloudnumbers.com 
E-Mail: markus.schmidber...@cloudnumbers.com 


* 
Amtsgericht München, HRB 191138 
Geschäftsführer: Erik Muttersbach, Markus Fensterer, Moritz v. 
Petersdorff-Campen 



[OMPI users] open-mpi error

2011-11-24 Thread Markus Stiller

Hello,

i have some problem with mpi, i looked in the FAQ and google already but 
i couldnt find a solution.


To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
$(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
FC="mpif90 -c" FCFLAGS="-O3" \
EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orterun.c at line 543
markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> 
sudo mpiexec -n 4 ./DLPOLY.Z
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orterun.c at line 543



Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus

markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> ompi_info 
--all
 Package: Open MPI abuild@build08 Distribution
Open MPI: 1.4.3
   Open MPI SVN revision: r23834
   Open MPI release date: Oct 05, 2010
Open RTE: 1.4.3
   Open RTE SVN revision: r23834
   Open RTE release date: Oct 05, 2010
OPAL: 1.4.3
   OPAL SVN revision: r23834
   OPAL release date: Oct 05, 2010
Ident string: 1.4.3
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.3)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.3)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.3)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3)
  Prefix: /usr/lib64/mpi/gcc/openmpi
 Exec_prefix: /usr/lib64/mpi/gcc/openmpi
  Bindir: /usr/lib64/mpi/gcc/openmpi/bin
 Sbindir: /usr/lib64/mpi/gcc/openmpi/sbin
  Libdir: /usr/lib64/mpi/gcc/openmpi/lib64
  Incdir: /usr/lib64/mpi/gcc/openmpi/include
  Mandir: /usr/lib64/mpi/gcc/openmpi/share/man
   Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi
  Libexecdir: /usr/lib64/mpi/gcc/openmpi/lib
 Datarootdir: /usr/lib64/mpi/gcc/openmpi/share
 Datadir: /usr/lib64/mpi/gcc/openmpi/share
  Sysconfdir: /etc
  Sharedstatedir: /usr/lib64/mpi/gcc/openmpi/com
   Localstatedir: /var
 Infodir: /usr/lib64/mpi/gcc/openmpi/share/info
  Pkgdatadir: /usr/lib64/mpi/gcc/openmpi/share/openmpi
   Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi
   Pkgincludedir: /usr/lib64/mpi/gcc/openmpi/include/openmpi
 Configured architecture: x86_64-suse-linux-gnu
  Configure host: build08
   Configured by: abuild
   Configured on: Sat Oct 29 15:50:22 UTC 2011
  Configure host: build08
Built by: abuild
Built on: Sat Oct 29 16:04:18 UTC 2011
  Built host: build08
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
 C char size: 1
 C bool size: 1
C short size: 2
  C int size: 4
 C long size: 8
C float size: 4
   C double size: 8
  C pointer size: 8
C char align: 1
C bool align: 1
 C int align: 4
   C float align: 4
  C double align: 8
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
   Fort integer size: 4
   Fort logical size: 4
 Fort lo

Re: [OMPI users] open-mpi error

2011-11-24 Thread Markus Stiller

On 11/24/2011 10:08 PM, MM wrote:

Hi

I get the same error while linking against home built 1.5.4 openmpi libs on
win32.
I didn't get this error against the prebuilt libs.

I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
available for Suse which contains the libraries and you could link against
those and that may work

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Markus Stiller
Sent: 24 November 2011 20:41
To: us...@open-mpi.org
Subject: [OMPI users] open-mpi error

Hello,

i have some problem with mpi, i looked in the FAQ and google already but i
couldnt find a solution.

To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
  $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
  FC="mpif90 -c" FCFLAGS="-O3" \
  EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
Simulations/Test Simu1>  sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_ess_base_select failed
-->  Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543


Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Hi,

thx for your answer.
When i try this (with mpich) i get problems with dl_poly itself:
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: 
cannot find -lmpi_f90
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: 
cannot find -lmpi_f77
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: 
cannot find -lmpi
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: 
cannot find -lopen-rte
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../x86_64-suse-linux/bin/ld: 
cannot find -lopen-pal


I do not really know how to get rid of this either ^^





Re: [OMPI users] open-mpi error

2011-11-24 Thread Markus Stiller

On 11/24/2011 10:08 PM, MM wrote:

Hi

I get the same error while linking against home built 1.5.4 openmpi libs on
win32.
I didn't get this error against the prebuilt libs.

I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
available for Suse which contains the libraries and you could link against
those and that may work

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Markus Stiller
Sent: 24 November 2011 20:41
To: us...@open-mpi.org
Subject: [OMPI users] open-mpi error

Hello,

i have some problem with mpi, i looked in the FAQ and google already but i
couldnt find a solution.

To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
  $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
  FC="mpif90 -c" FCFLAGS="-O3" \
  EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
Simulations/Test Simu1>  sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_ess_base_select failed
-->  Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543


Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Now i made open mpi new, but now im ggetting stuff like this:

..
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_find'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `asc_parse'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_register_string'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_register_int'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lampanic'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_thread_self'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_debug_close'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_CONVERSION_FN_NULL'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_read_at_all'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `sfh_sock_set_buf_size'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `blktype'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_preallocate'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `ao_init'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_mutex_destroy'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_iread_shared'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `al_init'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `stoi'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_hostmap'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_FORTRAN_ERRCODES_IGNORE'

/usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_close'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `al_next'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_Register_datarep'
/usr/loca

Re: [OMPI users] open-mpi error

2011-11-26 Thread Markus Stiller

Hi Castain,


You have some major problems with confused installations of MPIs. First, you cannot 
compile an application against>MPICH and expect to run it with OMPI - the two are 
not binary compatible. You need to compile against the MPI>installation you intend 
to run against.


I did this, sry i didnt tell this.
I tried mpich and openmpi, and of course for each case i compiled againt 
mpich and opem mpi



Second, your errors appear to be because you are not pointing your library path at the 
OMPI installation, and so the>libraries are not being found. You need to set 
LD_LIBRARY_PATH to include the path to where you installed OMPI.>Based on the 
configure line you give, that would mean ensuring that /opt/mpirun/lib was in that 
envar. Likewise,>/opt/mpirun/bin needs to be in your PATH.


hmmi installed openmpi in the std location, changed the variables to this 
and this works now

But now i have the same problem again (the problem why i wrote u in the first 
place):

markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> 
sudo mpirun -n 4 ./DLPOLY.Z

root's password:
[linux-6wa6:05565] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:05565] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orterun.c at line 543




What can i do with this?
Thx,
Markus

On 11/25/2011 03:42 AM, Ralph Castain wrote:

Hi Markus

You have some major problems with confused installations of MPIs. First, you 
cannot compile an application against MPICH and expect to run it with OMPI - 
the two are not binary compatible. You need to compile against the MPI 
installation you intend to run against.

Second, your errors appear to be because you are not pointing your library path 
at the OMPI installation, and so the libraries are not being found. You need to 
set LD_LIBRARY_PATH to include the path to where you installed OMPI. Based on 
the configure line you give, that would mean ensuring that /opt/mpirun/lib was 
in that envar. Likewise, /opt/mpirun/bin needs to be in your PATH.

Once you have those correctly set, and build your app against the appropriate 
mpicc, you should be able to run.

BTW: your last message indicates that you built against an old LAM MPI, so you 
appear to have some pretty old software laying around. Perhaps cleaning out 
some of the old MPI installations would help.


On Nov 24, 2011, at 4:32 PM, Markus Stiller wrote:


On 11/24/2011 10:08 PM, MM wrote:

Hi

I get the same error while linking against home built 1.5.4 openmpi libs on
win32.
I didn't get this error against the prebuilt libs.

I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
available for Suse which contains the libraries and you could link against
those and that may work

MM

-Original Message-
From:users-boun...@open-mpi.org  [mailto:users-boun...@open-mpi.org] On
Behalf Of Markus Stiller
Sent: 24 November 2011 20:41
To:us...@open-mpi.org
Subject: [OMPI users] open-mpi error

Hello,

i have some problem with mpi, i looked in the FAQ and google already but i
couldnt find a solution.

To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
  $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
  FC="mpif90 -c" FCFLAGS="-O3" \
  EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
Simulations/Test Simu1>   sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; he

[OMPI users] Problems with btl openib and MPI_THREAD_MULTIPLE

2012-11-07 Thread Markus Wittmann
Hello,

I've compiled Open MPI 1.6.3 with --enable-mpi-thread-multiple -with-tm
-with-openib --enable-opal-multi-threads.

When I use for example the pingpong benchmark from the Intel MPI
Benchmarks, which call MPI_Init the btl openib is used and everything
works fine.

When instead the benchmark calls MPI_Thread_init with
MPI_THREAD_MULTIPLE as requested threading level the btl openib fails
to load but gives no further hints for the reason:

mpirun -v -n 2 -npernode 1 -gmca btl_base_verbose 200 ./imb-
tm-openmpi-ts pingpong

...
[l0519:08267] select: initializing btl component openib
[l0519:08267] select: init of component openib returned failure
[l0519:08267] select: module openib unloaded
...

The question is now, is currently just the support for
MPI_THREADM_MULTIPLE missing in the openib module or are there other
errors occurring and if so, how to identify them.

Attached ist the config.log from the Open MPI build, the ompi_info
output and the output of the IMB pingpong bechmarks.

As system used were two nodes with:

  - OpenFabrics 1.5.3
  - CentOS release 5.8 (Final)
  - Linux Kernel 2.6.18-308.11.1.el5 x86_64
  - OpenSM 3.3.3

[l0519] src > ibv_devinfo
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.7.000
node_guid:  0030:48ff:fff6:31e4
sys_image_guid: 0030:48ff:fff6:31e7
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   SM_212201000
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 48
port_lid:   278
port_lmc:   0x00

Thanks for the help in advance.

Regards,
Markus


-- 
Markus Wittmann, HPC Services
Friedrich-Alexander-Universität Erlangen-Nürnberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-20104
markus.wittm...@fau.de
http://www.rrze.fau.de/hpc/


imb.txt.bz2
Description: application/bzip


imb-tm.txt.bz2
Description: application/bzip


ompi_info.txt.bz2
Description: application/bzip


config.log.bz2
Description: application/bzip


Re: [OMPI users] Problems with btl openib and MPI_THREAD_MULTIPLE

2012-11-08 Thread Markus Wittmann

Hi,

OK, that makes it clear.

Thank you for the fast response.

Regards,
Markus

Am 07.11.2012 13:49, schrieb Iliev, Hristo:

Hello, Markus,

The openib BTL component is not thread-safe. It disables itself when
the thread support level is MPI_THREAD_MULTIPLE. See this rant from
one of my colleagues:

http://www.open-mpi.org/community/lists/devel/2012/10/11584.php

A message is shown but only if the library was compiled with
developer-level debugging.

Open MPI guys, could the debug-level message in
btl_openib_component.c:btl_openib_component_init() be replaced by a
help text, e.g. the same way that the help text about the amount of
registerable memory not being enough is shown. Looks like the case of
openib being disabled for no apparent reason when MPI_THREAD_MULTIPLE
is in effect is not isolated to our users only. Or at least could you
put somewhere in the FAQ an explicit statement that openib is not
only not thread-safe, but that it would disable itself in a
multithreaded environment.

Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance
Computing RWTH Aachen University, Center for Computing and
Communication Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany) Tel: +49 241 80 24367 --
Fax/UMS: +49 241 80 624367


-Original Message- From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Markus Wittmann
Sent: Wednesday, November 07, 2012 1:14 PM To: us...@open-mpi.org
Subject: [OMPI users] Problems with btl openib and
MPI_THREAD_MULTIPLE

Hello,

I've compiled Open MPI 1.6.3 with --enable-mpi-thread-multiple
-with-tm - with-openib --enable-opal-multi-threads.

When I use for example the pingpong benchmark from the Intel MPI
Benchmarks, which call MPI_Init the btl openib is used and
everything

works

fine.

When instead the benchmark calls MPI_Thread_init with
MPI_THREAD_MULTIPLE as requested threading level the btl openib
fails to load but gives no further hints for the reason:

mpirun -v -n 2 -npernode 1 -gmca btl_base_verbose 200 ./imb-
tm-openmpi- ts pingpong

... [l0519:08267] select: initializing btl component openib
[l0519:08267]

select:

init of component openib returned failure [l0519:08267] select:
module openib unloaded ...

The question is now, is currently just the support for
MPI_THREADM_MULTIPLE missing in the openib module or are there
other errors occurring and if so, how to identify them.

Attached ist the config.log from the Open MPI build, the ompi_info
output and the output of the IMB pingpong bechmarks.

As system used were two nodes with:

- OpenFabrics 1.5.3 - CentOS release 5.8 (Final) - Linux Kernel
2.6.18-308.11.1.el5 x86_64 - OpenSM 3.3.3

[l0519] src > ibv_devinfo hca_id: mlx4_0 transport:
InfiniBand (0) fw_ver: 2.7.000 node_guid:
0030:48ff:fff6:31e4 sys_image_guid:
0030:48ff:fff6:31e7 vendor_id:  0x02c9
vendor_part_id: 26428 hw_ver:
0xB0 board_id:   SM_212201000
phys_port_cnt:  1 port:   1 state:
PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu:
2048 (4) sm_lid: 48 port_lid:   278
port_lmc:   0x00

Thanks for the help in advance.

Regards, Markus


-- Markus Wittmann, HPC Services Friedrich-Alexander-Universität
Erlangen-Nürnberg Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany Tel.: +49 9131 85-20104
markus.wittm...@fau.de http://www.rrze.fau.de/hpc/


___ users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Markus Wittmann, HPC Services
Friedrich-Alexander-Universität Erlangen-Nürnberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-20104
markus.wittm...@fau.de
http://www.rrze.fau.de/hpc/


Re: [OMPI users] MPI and C++ - now Send and Receive of Classes and STL containers

2009-07-07 Thread Markus Blatt
Hi,

On Mon, Jul 06, 2009 at 03:24:07PM -0400, Luis Vitorio Cargnini wrote:
> Thanks, but I really do not want to use Boost.
> Is easier ? certainly is, but I want to make it using only MPI
> itself
> and not been dependent of a Library, or templates like the majority
> of
> boost a huge set of templates and wrappers for different libraries,
> implemented in C, supplying a wrapper for C++.
> I admit Boost is a valuable tool, but in my case, as much
> independent I
> could be from additional libs, better.
>

If you do not want to use boost, then I suggest not using nested
vectors but just ones that contain PODs as value_type (or even
C-arrays).


If you insist on using complicated containers you will end up
writing your own MPI-C++ abstraction (resulting in a library). This
will be a lot of (unnecessary and hard) work.

Just my 2 cents.

Cheers,

Markus



[OMPI users] Problem with cascading derived data types

2009-02-27 Thread Markus Blatt
Hi,

In one of my applications I am using cascaded derived MPI datatypes
created with MPI_Type_struct. One of these types is used to just send
a part (one MPI_Char) of a struct consisting of an int followed by two
chars. I.e, the int at the beginning is/should be ignored.

This works fine if I use this data type on its own. 

Unfortunately I need to send another struct that contains an int and
the int-char-char struct from above. Again I construct a custom MPI
data type for this.

When sending this cascaded data type It seems that the offset of the
char in the inner custom type is disregarded on the receiving end and
the
received data ('1') is stored in the first int instead of the
following char.

I have tested this code with both lam and mpich. There it worked as
expected (saving the '1' in the first char).

The last two lines of the output of the attached test case read

received global=10 attribute=0 (local=1 public=0)
received  attribute=1 (local=100 public=0)

for openmi instead of

received global=10 attribute=1 (local=100 public=0)
received  attribute=1 (local=100 public=0)

for lam and mpich.

The same problem is experienced when using version 1.3-2 of openmpi.

Am I doing something completely wrong or have I accidentally found a bug?

Cheers,

Markus


#include"mpi.h"
#include

struct LocalIndex
{
  int local_;
  char attribute_;
  char public_;
};


struct IndexPair
{
  int global_;
  LocalIndex local_;
};


int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if(size<2)
{
  std::cerr<<"no procs has to be >2"<

[OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

2008-02-22 Thread John Markus Bjørndalen
d8cb69 in mca_pml_ob1_recv () from 
/home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
#1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear () from 
/home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
#1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed () from 
/home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
#1356867 0xb7f3db6c in PMPI_Reduce () from 
/home/johnm/local/ompi/lib/libmpi.so.0

#1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-crash2.c:58
--- snip-

I poked around in the code, and it looks like the culprit might be in 
the macros that try to allocate fragments in 
mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and 
MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can end 
up calling opal_condition_wait(). opal_condition_wait() calls 
opal_progress() to "block", which looks like it leads to infinite 
recursion in this case.


I guess the problem is a race condition when one node is hammered with 
incoming packets.


The stack trace contains about 1.35 million lines, so I won't include 
all of it here, but here's some statistics to verify that not much else 
is happening in that stack (I can make the full trace available if 
anybody needs it):


--- snip-
Number of callframes:  1356870
Called function statistics (how often in stackdump):
 PMPI_Reduce1
 _int_malloc1
 main   1
 malloc 1
 mca_btl_tcp_endpoint_recv_handler 339197
 mca_pml_ob1_recv   1
 mca_pml_ob1_recv_frag_match   72
 ompi_coll_tuned_reduce_intra_basic_linear   1
 ompi_coll_tuned_reduce_intra_dec_fixed 1
 ompi_free_list_grow1
 opal_event_base_loop  339197
 opal_event_loop   339197
 opal_progress 339197
 sysconf2
Address statistics (how often in stackdump), plus functions with that addr
(sanity check):
 0x00434184 2 set(['sysconf'])
 0x0804899e 1 set(['main'])
 0xb7d55913 1 
set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
 0xb7d5bb1c 1 
set(['ompi_coll_tuned_reduce_intra_basic_linear'])
 0xb7d74a7d72 
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d74e70 1 
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d74f08339124 
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d8cb69 1 
set(['mca_pml_ob1_recv'])
 0xb7d8f38972 
set(['mca_pml_ob1_recv_frag_match'])
 0xb7e5d284339197 
set(['opal_progress'])
 0xb7e62b44339197 
set(['opal_event_base_loop'])
 0xb7e62cff339197 
set(['opal_event_loop'])

 0xb7e78b59 1 set(['_int_malloc'])
 0xb7e799ce 1 set(['malloc'])
 0xb7f04852 1 
set(['ompi_free_list_grow'])

 0xb7f3db6c 1 set(['PMPI_Reduce'])
--- snip-

I don't have any suggestions for a fix though, since this is the first 
time I've looked into the OpenMPI code.


Btw. In case it makes a difference for triggering the bug: I'm running 
this on a cluster with 1 frontend and 44 nodes. The cluster runs Rocks 
4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB RAM, 
connected with gigabit Ethernet.



Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/




Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

2008-02-28 Thread John Markus Bjørndalen

Hi, and thanks for the feedback everyone.

George Bosilca wrote:
Brian is completely right. Here is a more detailed description of this 
problem.

[]
On the other side, I hope that not many users write such applications. 
This is the best way to completely kill the performances of any MPI 
implementation, by overloading one process with messages. This is 
exactly what MPI_Reduce and MPI_Gather do, one process will get the 
final result and all other processes only have to send some data. This 
behavior only arises when the gather or the reduce use a very flat 
tree, and only for short messages. Because of the short messages there 
is no handshake between the sender and the receiver, which will make 
all messages unexpected, and the flat tree guarantee that there will 
be a lot of small messages. If you add a barrier every now and then 
(100 iterations) this problem will never happens.
I have done some more testing. Of the tested parameters, I'm observing 
this behaviour with group sizes from 16-44, and from 1 to 32768 integers 
in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 
16-44 and from 1 to 4096 integers (per node).


In other words, it actually happens with other tree configurations and 
larger packet sizes :-/


By the way, I'm also observing crashes with MPI_Broadcast (groups of 
size 4-44 with the root process (rank 0) broadcasting integer arrays of 
size 16384 and 32768).  It looks like the root process is crashing. Can 
a sender crash because it runs out of buffer space as well?


-- snip --
/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 
./ompi-crash  16384 1 3000
{  'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 
262144, 'iters' : 3000, 'bmno' : 1
[compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited 
on signal 15 (Terminated).

3 additional processes aborted (not shown)
-- snip --


One more thing, doing a lot of collective in a loop and computing the 
total time is not the correct way to evaluate the cost of any 
collective communication, simply because you will favor all algorithms 
based on pipelining. There is plenty of literature about this topic.


  george.
As I said in the original e-mail: I had only thrown them in for a bit of 
sanity checking. I expected funny numbers, but not that OpenMPI would 
crash.


The original idea was just to make a quick comparison of Allreduce, 
Allgather and Alltoall in LAM and OpenMPI. The opportunity for 
pipelining the operations there is rather small since they can't get 
much out of phase with each other.



Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/




Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

2008-02-29 Thread John Markus Bjørndalen

George Bosilca wrote:


[.]


I don't think the root crashed. I guess that one of the other nodes 
crashed, the root got a bad socket (which is what the first error 
message seems to indicate), and get terminated. As the output is not 
synchronized between the nodes, one cannot rely on its order nor 
contents. Moreover, mpirun report that the root was killed with signal 
15, which is how we cleanup the remaining processes when we detect 
that something really bad (like a seg fault) happened in the parallel 
application.


Sorry, I should have rephrased that as a question ("is it the root?"). 
I'm not that familiar with the debug output of OpenMPI yet, so I 
included it in case somebody made more sense of it than me.




There are many differences between the routed and non routed 
collectives. All errors that you reported so far are related to rooted 
collectives, which make sense. I didn't state that it is normal that 
Open MPI do not behave [sic]. I wonder if you can get such errors with 
non routed collectives (such as allreduce, allgather and alltoall), or 
with messages larger than the eager size ?

You're right, I haven't seen any crashes with the All*-variants.

TCP eager limit is set to 65536 (output from ompi_info):

MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536")
MCA btl: parameter "btl_tcp_min_send_size" (current value: "65536")
MCA btl: parameter "btl_tcp_max_send_size" (current value: "131072")

I observed crashes with Broadcasts and Reduces of 131072 bytes. I'm 
playing around with larger messages now, and while Reduce with 16 nodes 
seem stable at 262144 byte messages, it still crashes with 44 nodes.




If you type "ompi_info --param btl tcp", you will see what is the 
eager size for the TCP BTL. Everything smaller than this size will be 
send eagerly; have the opportunity to became unexpected on the 
receiver side and can lead to this problem. As a quick test, you can 
add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and 
this problem will not happen with for size over the 2K. This was the 
original solution for the flow control problem. If you know your 
application will generate thousands of unexpected messages, then you 
should set the eager limit to zero.
I tried running Reduce with 4096 ints (16384 bytes), 16 nodes and eager 
limit 2048:


mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 2048 
./ompi-crash 4096 2 3000
{  'groupsize' : 16, 'count' : 4096, 'bytes' : 16384, 'bufbytes' : 
262144, 'iters' : 3000, 'bmno' : 2
[compute-2-2][0,1,10][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] 
[compute-3-2][0,1,14][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed with errno=104

mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 30407 on node compute-0-0 exited 
on signal 15 (Terminated).

15 additional processes aborted (not shown)

This one tries to run Reduce with 1 integer per node and also crashes 
(with eager size 0):


/mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 0 
./ompi-crash 1 2 3000

...

This is puzzling.


I'm mostly familiarizing myself with OpenMPI at the moment as well as 
poking around to see how the collective operations work and perform 
compared to LAM. Partly because I have some ideas I'd like to test out, 
and partly because I'm considering to move some student exercises over 
from LAM to OpenMPI. I don't expect to write actual applications that 
treat MPI like this myself, but on the other hand, not having to do 
throttling on top of MPI could be an advantage in some application 
patterns.



Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/