Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Bogdan Costescu


Brett Pemberton  wrote:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from 
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0


I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with 
all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some 
MVAPICH versions, from which I have concluded that the problem lies in 
the lower levels (OFED or IB card firmware). Indeed after the 
installation of OFED 1.3.x and a possible firmware update (not sure 
about the firmware as I don't admin that cluster), these errors have 
disappeared.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Biagio Lucini

Bogdan Costescu wrote:


Brett Pemberton  wrote:


[[1176,1],0][btl_openib_component.c:2905:handle_wc] from
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0


I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with
all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some
MVAPICH versions, from which I have concluded that the problem lies in
the lower levels (OFED or IB card firmware). Indeed after the
installation of OFED 1.3.x and a possible firmware update (not sure
about the firmware as I don't admin that cluster), these errors have
disappeared.



I can confirm this: I had a similar problem over Christmas, for which I 
asked for help in this list. In fact the problem was not with OpenMPI, 
but with the OFED stack: an upgrade of the latter (and an upgrade of the 
firmware, although once again the OFED drivers were complaining about 
the firmware being too old) fixed the problem. We did both upgrades at 
once, so as in Brett's case I am not sure which one played the major role.


Biagio

--
=

Dr. Biagio Lucini   
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284

=


[OMPI users] libmpi_f90.so not being built

2009-02-27 Thread Tiago Silva

Hi,

I am trying to build openmpi 1.3 on Cent_OS with gcc and the lahey f95 
compiler with the following configuration:


./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/lfc 
  --prefix=/opt/openmpi-1.3_lfc


When I "make install all" the process fails to build libmpi_f90.la 
because  libmpi_f90.so.0 isn't found (see output at the end of the 
post). I can't grep any other mention to libmpi_f90.so being built in 
config.log or on the output from the make and indeed it is not on the 
build directory with the other shared libraries:


[root@server lfc]# find . -name "libmpi*.so*"
./ompi/.libs/libmpi.so
./ompi/.libs/libmpi.so.0
./ompi/.libs/libmpi.so.0.0.0
./ompi/.libs/libmpi.so.0.0.0T
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0T
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0
./ompi/mpi/cxx/.libs/libmpi_cxx.so
./ompi/mpi/f77/.libs/libmpi_f77.so.0
./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0
./ompi/mpi/f77/.libs/libmpi_f77.so
./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0T


I believe that shared libraries for f90 bindings should be built by 
default but even trying to force the f90 bindings with shared libraries 
didn't do the trick:


./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/lfc 
 F90=/share/apps/lf6481/bin/lfc --prefix=/opt/openmpi-1.3_lfc 
--enable-shared --with-mpi_f90_size=medium --enable-mpi-f90


Any sugestions of what might be going wrong are most welcome.

Thanks,
TS

[root@server lfc]# tail install.out
make[4]: Entering directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[5]: Entering directory `/root/builds/openmpi-1.3/lfc'
make[5]: Leaving directory `/root/builds/openmpi-1.3/lfc'
/bin/sh ../../../libtool   --mode=link /share/apps/lf6481/bin/lfc 
-I../../../omp
i/include -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90 
-export-dyn
amic   -o libmpi_f90.la -rpath /opt/openmpi-1.3_lfc/lib mpi.lo 
mpi_sizeof.lo mpi
_comm_spawn_multiple_f90.lo mpi_testall_f90.lo mpi_testsome_f90.lo 
mpi_waitall_f
90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo mpi_wtime_f90.lo 
../../../ompi/libm

pi.la -lnsl -lutil  -lm
libtool: link: /share/apps/lf6481/bin/lfc -shared  .libs/mpi.o 
.libs/mpi_sizeof.
o .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o 
.libs/mpi_testsome
_f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o 
.libs/mpi_wtick_f90.o .l
ibs/mpi_wtime_f90.o-rpath /root/builds/openmpi-1.3/lfc/ompi/.libs 
-rpath /ro
ot/builds/openmpi-1.3/lfc/orte/.libs -rpath 
/root/builds/openmpi-1.3/lfc/opal/.l
ibs -rpath /opt/openmpi-1.3_lfc/lib 
-L/root/builds/openmpi-1.3/lfc/orte/.libs -L
/root/builds/openmpi-1.3/lfc/opal/.libs ../../../ompi/.libs/libmpi.so 
/root/buil
ds/openmpi-1.3/lfc/orte/.libs/libopen-rte.so 
/root/builds/openmpi-1.3/lfc/opal/.
libs/libopen-pal.so -ldl -lnsl -lutil -lm-pthread -soname 
libmpi_f90.so.0 -o

 .libs/libmpi_f90.so.0.0.0
ERROR --  Could not find specified object file libmpi_f90.so.0.
make[4]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[3]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[2]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[1]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi'



[OMPI users] Problem with cascading derived data types

2009-02-27 Thread Markus Blatt
Hi,

In one of my applications I am using cascaded derived MPI datatypes
created with MPI_Type_struct. One of these types is used to just send
a part (one MPI_Char) of a struct consisting of an int followed by two
chars. I.e, the int at the beginning is/should be ignored.

This works fine if I use this data type on its own. 

Unfortunately I need to send another struct that contains an int and
the int-char-char struct from above. Again I construct a custom MPI
data type for this.

When sending this cascaded data type It seems that the offset of the
char in the inner custom type is disregarded on the receiving end and
the
received data ('1') is stored in the first int instead of the
following char.

I have tested this code with both lam and mpich. There it worked as
expected (saving the '1' in the first char).

The last two lines of the output of the attached test case read

received global=10 attribute=0 (local=1 public=0)
received  attribute=1 (local=100 public=0)

for openmi instead of

received global=10 attribute=1 (local=100 public=0)
received  attribute=1 (local=100 public=0)

for lam and mpich.

The same problem is experienced when using version 1.3-2 of openmpi.

Am I doing something completely wrong or have I accidentally found a bug?

Cheers,

Markus


#include"mpi.h"
#include

struct LocalIndex
{
  int local_;
  char attribute_;
  char public_;
};


struct IndexPair
{
  int global_;
  LocalIndex local_;
};


int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if(size<2)
{
  std::cerr<<"no procs has to be >2"<

Re: [OMPI users] libmpi_f90.so not being built

2009-02-27 Thread Jeff Squyres

Can you please send all the information listed here:

http://www.open-mpi.org/community/help/



On Feb 27, 2009, at 6:38 AM, Tiago Silva wrote:


Hi,

I am trying to build openmpi 1.3 on Cent_OS with gcc and the lahey  
f95 compiler with the following configuration:


./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/ 
lfc   --prefix=/opt/openmpi-1.3_lfc


When I "make install all" the process fails to build libmpi_f90.la  
because  libmpi_f90.so.0 isn't found (see output at the end of the  
post). I can't grep any other mention to libmpi_f90.so being built  
in config.log or on the output from the make and indeed it is not on  
the build directory with the other shared libraries:


[root@server lfc]# find . -name "libmpi*.so*"
./ompi/.libs/libmpi.so
./ompi/.libs/libmpi.so.0
./ompi/.libs/libmpi.so.0.0.0
./ompi/.libs/libmpi.so.0.0.0T
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0T
./ompi/mpi/cxx/.libs/libmpi_cxx.so.0
./ompi/mpi/cxx/.libs/libmpi_cxx.so
./ompi/mpi/f77/.libs/libmpi_f77.so.0
./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0
./ompi/mpi/f77/.libs/libmpi_f77.so
./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0T


I believe that shared libraries for f90 bindings should be built by  
default but even trying to force the f90 bindings with shared  
libraries didn't do the trick:


./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/ 
lfc  F90=/share/apps/lf6481/bin/lfc --prefix=/opt/openmpi-1.3_lfc -- 
enable-shared --with-mpi_f90_size=medium --enable-mpi-f90


Any sugestions of what might be going wrong are most welcome.

Thanks,
TS

[root@server lfc]# tail install.out
make[4]: Entering directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/ 
f90'

make[5]: Entering directory `/root/builds/openmpi-1.3/lfc'
make[5]: Leaving directory `/root/builds/openmpi-1.3/lfc'
/bin/sh ../../../libtool   --mode=link /share/apps/lf6481/bin/lfc - 
I../../../omp
i/include -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90 - 
export-dyn
amic   -o libmpi_f90.la -rpath /opt/openmpi-1.3_lfc/lib mpi.lo  
mpi_sizeof.lo mpi
_comm_spawn_multiple_f90.lo mpi_testall_f90.lo mpi_testsome_f90.lo  
mpi_waitall_f
90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo mpi_wtime_f90.lo ../../../ 
ompi/libm

pi.la -lnsl -lutil  -lm
libtool: link: /share/apps/lf6481/bin/lfc -shared  .libs/mpi.o .libs/ 
mpi_sizeof.
o .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o .libs/ 
mpi_testsome
_f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o .libs/ 
mpi_wtick_f90.o .l
ibs/mpi_wtime_f90.o-rpath /root/builds/openmpi-1.3/lfc/ 
ompi/.libs -rpath /ro
ot/builds/openmpi-1.3/lfc/orte/.libs -rpath /root/builds/openmpi-1.3/ 
lfc/opal/.l
ibs -rpath /opt/openmpi-1.3_lfc/lib -L/root/builds/openmpi-1.3/lfc/ 
orte/.libs -L
/root/builds/openmpi-1.3/lfc/opal/.libs ../../../ompi/.libs/ 
libmpi.so /root/buil
ds/openmpi-1.3/lfc/orte/.libs/libopen-rte.so /root/builds/ 
openmpi-1.3/lfc/opal/.
libs/libopen-pal.so -ldl -lnsl -lutil -lm-pthread -soname  
libmpi_f90.so.0 -o

.libs/libmpi_f90.so.0.0.0
ERROR --  Could not find specified object file libmpi_f90.so.0.
make[4]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[3]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[2]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90'
make[1]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi'

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] libmpi_f90.so not being built

2009-02-27 Thread Tiago Silva


ok, here is the complete output in the tgz file attached. The output is 
slightly different as I am now only using "make all" and not installing. 
I did a full "make clean" and "rm -fr /*" and the  
already exists but is empty.


Thanks


ts-output.tgz
Description: Binary data


[OMPI users] more XGrid Problems with openmpi1.2.9

2009-02-27 Thread Ricardo Fernández-Perea
Hi
It seems to me more like time issues.
All the runs end with something similar to


Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x45485308
Crashed Thread:  0

Thread 0 Crashed:
0   libSystem.B.dylib 0x95208f04 strcmp + 84
1   libopen-rte.0.dylib   0x000786fd
orte_pls_base_get_active_daemons + 45
2   mca_pls_xgrid.so   0x00271725
orte_pls_xgrid_terminate_orteds + 117 (pls_xgrid_module.m:133)
3   mpirun 0x20ec orterun + 1896 (orterun.c:468)
4   mpirun 0x1982 main + 24 (main.c:14)
5   mpirun 0x193e start + 54

Thread 1:

A simple mpirun -n 4 ring give the following results

 Process 0 sending   10  to1  tag  201  (
4  processes in ring)
 Process1  exiting
 Process2  exiting
 Process3  exiting
 Process 0 sent to1
 Process 0 decremented value:   9
 Process 0 decremented value:   8
 Process 0 decremented value:   7
 Process 0 decremented value:   6
 Process 0 decremented value:   5
 Process 0 decremented value:   4
 Process 0 decremented value:   3
 Process 0 decremented value:   2
 Process 0 decremented value:   1
 Process 0 decremented value:   0
 Process0  exiting
[nexus11:38502] *** Process received signal ***
[nexus11:38502] Signal: Segmentation fault (11)
[nexus11:38502] Signal code: Address not mapped (1)
[nexus11:38502] Failing at address: 0x45485308
[nexus11:38502] [ 0] 2   libSystem.B.dylib   0x9526c2bb
_sigtramp + 43
[nexus11:38502] [ 1] 3   ??? 0x 0x0
+ 4294967295
[nexus11:38502] [ 2] 4   libopen-rte.0.dylib 0x000786fd
orte_pls_base_get_active_daemons + 45
[nexus11:38502] [ 3] 5   mca_pls_xgrid.so0x00271725
orte_pls_xgrid_terminate_orteds + 117
[nexus11:38502] [ 4] 6   mpirun  0x20ec
orterun + 1896
[nexus11:38502] [ 5] 7   mpirun  0x1982 main
+ 24
[nexus11:38502] [ 6] 8   mpirun  0x193e
start + 54
[nexus11:38502] [ 7] 9   ??? 0x0004 0x0
+ 4
[nexus11:38502] *** End of error message ***
Segmentation fault

Any idea of what I can do?

Ricardo


my ompi_info is
Open MPI: 1.2.9
   Open MPI SVN revision: r20259
Open RTE: 1.2.9
   Open RTE SVN revision: r20259
OPAL: 1.2.9
   OPAL SVN revision: r20259
  Prefix: /opt/openmpi
 Configured architecture: i386-apple-darwin9.6.0
   Configured by: sofhtest
   Configured on: Fri Feb 27 11:02:30 CET 2009
  Configure host: nexus10.nlroc
Built by: sofhtest
Built on: Fri Feb 27 12:00:08 CET 2009
  Built host: nexus10.nlroc
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (single underscore)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc-4.2
 C compiler absolute: /usr/bin/gcc-4.2
C++ compiler: g++-4.2
   C++ compiler absolute: /usr/bin/g++-4.2
  Fortran77 compiler: gfortran-4.2
  Fortran77 compiler abs: /usr/bin/gfortran-4.2
  Fortran90 compiler: gfortran-4.2
  Fortran90 compiler abs: /usr/bin/gfortran-4.2
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.9)
  MCA memory: darwin (MCA v1.0, API v1.0, Component v1.2.9)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.9)
   MCA timer: darwin (MCA v1.0, API v1.0, Component v1.2.9)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.9)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.9)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.9)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.9)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.9)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.9)
 MCA p

[OMPI users] Fwd: more XGrid Problems with openmpi1.2.9 (error find)

2009-02-27 Thread Ricardo Fernández-Perea
Find the problem
in
orte_pls_xgrid_terminate_orteds

orte_pls_base_get_active_daemons is been call as
orte_pls_base_get_active_daemons(&daemons, jobid)

when the correct way of doing it is

orte_pls_base_get_active_daemons(&daemons, jobid, attrs)

yours.

Ricardo


Hi
It seems to me more like time issues.
All the runs end with something similar to


Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x45485308
Crashed Thread:  0

Thread 0 Crashed:
0   libSystem.B.dylib 0x95208f04 strcmp + 84
1   libopen-rte.0.dylib   0x000786fd
orte_pls_base_get_active_daemons + 45
2   mca_pls_xgrid.so   0x00271725
orte_pls_xgrid_terminate_orteds + 117 (pls_xgrid_module.m:133)
3   mpirun 0x20ec orterun + 1896 (orterun.c:468)
4   mpirun 0x1982 main + 24 (main.c:14)
5   mpirun 0x193e start + 54

Thread 1:

A simple mpirun -n 4 ring give the following results

 Process 0 sending   10  to1  tag  201  (
4  processes in ring)
 Process1  exiting
 Process2  exiting
 Process3  exiting
 Process 0 sent to1
 Process 0 decremented value:   9
 Process 0 decremented value:   8
 Process 0 decremented value:   7
 Process 0 decremented value:   6
 Process 0 decremented value:   5
 Process 0 decremented value:   4
 Process 0 decremented value:   3
 Process 0 decremented value:   2
 Process 0 decremented value:   1
 Process 0 decremented value:   0
 Process0  exiting
[nexus11:38502] *** Process received signal ***
[nexus11:38502] Signal: Segmentation fault (11)
[nexus11:38502] Signal code: Address not mapped (1)
[nexus11:38502] Failing at address: 0x45485308
[nexus11:38502] [ 0] 2   libSystem.B.dylib   0x9526c2bb
_sigtramp + 43
[nexus11:38502] [ 1] 3   ??? 0x 0x0
+ 4294967295
[nexus11:38502] [ 2] 4   libopen-rte.0.dylib 0x000786fd
orte_pls_base_get_active_daemons + 45
[nexus11:38502] [ 3] 5   mca_pls_xgrid.so0x00271725
orte_pls_xgrid_terminate_orteds + 117
[nexus11:38502] [ 4] 6   mpirun  0x20ec
orterun + 1896
[nexus11:38502] [ 5] 7   mpirun  0x1982 main
+ 24
[nexus11:38502] [ 6] 8   mpirun  0x193e
start + 54
[nexus11:38502] [ 7] 9   ??? 0x0004 0x0
+ 4
[nexus11:38502] *** End of error message ***
Segmentation fault

Any idea of what I can do?

Ricardo


my ompi_info is
Open MPI: 1.2.9
   Open MPI SVN revision: r20259
Open RTE: 1.2.9
   Open RTE SVN revision: r20259
OPAL: 1.2.9
   OPAL SVN revision: r20259
  Prefix: /opt/openmpi
 Configured architecture: i386-apple-darwin9.6.0
   Configured by: sofhtest
   Configured on: Fri Feb 27 11:02:30 CET 2009
  Configure host: nexus10.nlroc
Built by: sofhtest
Built on: Fri Feb 27 12:00:08 CET 2009
  Built host: nexus10.nlroc
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (single underscore)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc-4.2
 C compiler absolute: /usr/bin/gcc-4.2
C++ compiler: g++-4.2
   C++ compiler absolute: /usr/bin/g++-4.2
  Fortran77 compiler: gfortran-4.2
  Fortran77 compiler abs: /usr/bin/gfortran-4.2
  Fortran90 compiler: gfortran-4.2
  Fortran90 compiler abs: /usr/bin/gfortran-4.2
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.9)
  MCA memory: darwin (MCA v1.0, API v1.0, Component v1.2.9)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.9)
   MCA timer: darwin (MCA v1.0, API v1.0, Component v1.2.9)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.9)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.9)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.9)
MCA coll: tuned (MCA

Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Rolf Vandevaart


I just tried trunk-1.4a1r20458 and I did not see this error, although my 
configuration was rather different.  I ran across 100 2-CPU sparc nodes, 
np=256, connected with TCP.


Hopefully George's comment helps out with this issue.

One other thought to see whether SGE has anything to do with this is 
create a hostfile and run it outside of SGE.


Rolf

On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and 
everything checks out fine. However, this is running under SGE and thus 
using qrsh, so it is possible the SGE support is having a problem.


Perhaps one of the Sun OMPI developers can help here?

Ralph

On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:

It looks like the system doesn't know what nodes the procs are to be 
placed upon. Can you run this with --display-devel-map? That will tell 
us where the system thinks it is placing things.


Thanks
Ralph

On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:


Maybe it's my pine mailer.

This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
shangai nodes running a standard benchmark called stmv.

The basic error message, which occurs 31 times is like:

[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file 
../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595


The mpirun command has long paths in it, sorry. It's invoking a 
special binding

script which in turn lauches the NAMD run. This works on an older SVN at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 
256 proc run where
the older SVN hangs indefinitely polling some completion (sm or 
openib). So, I was trying

later SVNs with this 256 proc run, hoping the error would go away.

Here's some of the invocation again. Hope you can read it:

EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE

and, unexpanded

mpirun --prefix $PREFIX -np %PE% $MCA -x 
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit 
-x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit 
-machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd


and, expanded

mpirun --prefix 
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron 
-np 256 --mca btl sm,openib,self -x 
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit 
-x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit 
-machinefile /tmp/48292.1.all.q/newhosts 
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL 
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2 
stmv.namd


This is all via Sun Grid Engine.
The OS as indicated above is SuSE SLES 10 SP2.

DM
On Thu, 26 Feb 2009, Ralph Castain wrote:

I'm sorry, but I can't make any sense of this message. Could you 
provide a
little explanation of what you are doing, what the system looks 
like, what is

supposed to happen, etc? I can barely parse your cmd line...

Thanks
Ralph

On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:


Today's and yesterdays.

1.4a1r20643_svn

+ mpirun --prefix
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6 


4/opteron -np 256 --mca btl sm,openib,self -x
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
269.1.all.q/newhosts
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
/ctmp8/mostyn/IM
SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li 


nux-amd64-MPI/namd2 stmv.namd
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls_
base_default_fns.c at line 595
[s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595

Made with INTEL compilers 10.1.015.


Regards,
Mostyn

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list

Re: [OMPI users] 3.5 seconds before application launches

2009-02-27 Thread Vittorio Giovara
Hello, and thanks for both replies,

I've tried to run non-mpi program but i still measured some latency time
before starting, something around 2 seconds this time.
SSH should be properly configured, in fact i can login to both machines
without password; openmpi and mvapich use ssh as default.

i've tried these commands
mpirun  --mca btl ^sm  -np 2 -host node0 -host node1 ./graph
mpirun  --mca btl openib,self  -np 2 -host node0 -host node1 ./graph

and, apart a slight performance increase in the ^sm benchmark, the latency
time is the same
this is really strange, but i can't figure out the source!
do you have any other ideas?
thanks
Vittorio

List-Post: users@lists.open-mpi.org
Date: Wed, 25 Feb 2009 20:20:51 -0500
From: Jeff Squyres 
Subject: Re: [OMPI users] 3.5 seconds before application launches
To: Open MPI Users 
Message-ID: <86d3b246-1866-4b84-b05c-4d13659f8...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

Dorian raises a good point.

You might want to try some simple tests of launching non-MPI codes
(e.g., hostname, uptime, etc.) and see how they fare.  Those will more
accurately depict OMPI's launching speeds.  Getting through MPI_INIT
is another matter (although on 2 nodes, the startup should be pretty
darn fast).

Two other things that *may* impact you:

1. Is your ssh speed between the machines slow?  OMPI uses ssh by
default, but will fall back to rsh (or you can force rsh if you
want).  MVAPICH may use rsh by default...?  (I don't actually know)

2. OMPI may be spending time creating shared memory files.  You can
disable OMPI's use of shared memory by running with:

mpirun --mca btl ^sm ...

Meaning "use anything except the 'sm' (shared memory) transport for
MPI messages".


On Feb 25, 2009, at 4:01 PM, doriankrause wrote:

> Vittorio wrote:
>> Hi!
>> I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm
>> using
>> Gentoo Linux x86_64.
>>
>> I've noticed that before any application starts there is a variable
>> amount
>> of time (around 3.5 seconds) in which the terminal just hangs with
>> no output
>> and then the application starts and works well.
>>
>> I imagined that there might have been some initialization routine
>> somewhere
>> in the Infiniband layer or in the software stack, but as i
>> continued my
>> tests i observed that this "latency" time is not present in other MPI
>> implementations (like mvapich2) where my application starts
>> immediately (but
>> performs worse).
>>
>> Is my MPI configuration/installation broken or is this expected
>> behaviour?
>>
>
> Hi,
>
> I'm not really qualified to answer this question, but I know that in
> contrast
> to other MPI implementations (MPICH) the modular structure of Open
> MPI is based
> on shared libs that are dlopened at the startup. As symbol
> relocation can be
> costly this might be a reason why the startup time is higher.
>
> Have you checked wether this is an mpiexec start issue or the
> MPI_Init call?
>
> Regards,
> Dorian
>
>> thanks a lot!
>> Vittorio
>>
>>
>> 
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


[OMPI users] TCP instead of openIB doesn't work

2009-02-27 Thread Vittorio Giovara
Hello, i'm posting here another problem of my installation
I wanted to benchmark the differences between tcp and openib transport

if i run a simple non mpi application i get
randori ~ # mpirun  --mca btl tcp,self  -np 2 -host randori -host tatami
hostname
randori
tatami

but as soon as i switch to my benchmark program i have
mpirun  --mca btl tcp,self  -np 2 -host randori -host tatami graph
Master thread reporting
matrix size 33554432 kB, time is in [us]

and instead of starting the send/receive functions it just hangs there; i
also checked the transmitted packets with wireshark but after the handshake
no more packets are exchanged

I read in the archives that there were some problems in this area and so i
tried what was suggested in previous emails

mpirun --mca btl ^openib  -np 2 -host randori -host tatami graph
mpirun --mca pml ob1  --mca btl tcp,self  -np 2 -host randori -host tatami
graph

gives exactly the same output as before (no mpisend/receive)
while the next commands gives something more interesting

mpirun --mca pml cm  --mca btl tcp,self  -np 2 -host randori -host tatami
graph
--
No available pml components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.

--
[tatami:06619] PML cm cannot be selected
mpirun noticed that job rank 0 with PID 6710 on node randori exited on
signal 15 (Terminated).

which is not possible as if i do ompi_info --param all there is the CM pml
component

 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)


my test program is quite simple, just a couple of MPI_Send and MPI_Recv
(just after the signature)
do you have any ideas that might help me?
thanks a lot
Vittorio


#include "mpi.h"
#include 
#include 
#include 
#include 

#define M_COL 4096
#define M_ROW 524288
#define NUM_MSG 25

unsigned long int  gigamatrix[M_ROW][M_COL];

int main (int argc, char *argv[]) {
int numtasks, rank, dest, source, rc, tmp, count, tag=1;
unsigned long int  exp, exchanged;
unsigned long int i, j, e;
unsigned long matsize;
MPI_Status Stat;
struct timeval timing_start, timing_end;
double inittime = 0;
long int totaltime = 0;

MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);


if (rank == 0) {
fprintf (stderr, "Master thread reporting\n", numtasks - 1);
matsize = (long) M_COL * M_ROW / 64;
fprintf (stderr, "matrix size %d kB, time is in [us]\n", matsize);

source = 1;
dest = 1;

/*warm up phase*/
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);

for (e = 0; e < NUM_MSG; e++) {
exp = pow (2, e);
exchanged = 64 * exp;

/*timing of ops*/
gettimeofday (&timing_start, NULL);
rc = MPI_Send (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG,
dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG,
source, tag, MPI_COMM_WORLD, &Stat);
gettimeofday (&timing_end, NULL);

totaltime = (timing_end.tv_sec - timing_start.tv_sec) * 100
+ (timing_end.tv_usec - timing_start.tv_usec);
memset (&timing_start, 0, sizeof(struct timeval));
memset (&timing_end, 0, sizeof(struct timeval));
fprintf (stdout, "%d kB\t%d\n", exp, totaltime);
}

fprintf(stderr, "task complete\n");

} else {
if (rank >= 1) {
dest = 0;
source = 0;

rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD,
&Stat);

   

Re: [OMPI users] TCP instead of openIB doesn't work

2009-02-27 Thread Ralph Castain
I'm not entirely sure what is causing the problem here, but one thing  
does stand out. You have specified two -host options for the same  
application - this is not our normal syntax. The usual way of  
specifying this would be:


mpirun  --mca btl tcp,self  -np 2 -host randori,tatami hostname

I'm not entirely sure what OMPI does when it gets two separate -host  
arguments - could be equivalent to the above syntax, but could also  
cause some unusual behavior.


Could you retry your job with the revised syntax? Also, could you add  
--display-map to your mpirun cmd line? This will tell us where OMPI  
thinks the procs are going, and a little info about how it interpreted  
your cmd line.


Thanks
Ralph


On Feb 27, 2009, at 8:00 AM, Vittorio Giovara wrote:


Hello, i'm posting here another problem of my installation
I wanted to benchmark the differences between tcp and openib transport

if i run a simple non mpi application i get
randori ~ # mpirun  --mca btl tcp,self  -np 2 -host randori -host  
tatami hostname

randori
tatami

but as soon as i switch to my benchmark program i have
mpirun  --mca btl tcp,self  -np 2 -host randori -host tatami graph
Master thread reporting
matrix size 33554432 kB, time is in [us]

and instead of starting the send/receive functions it just hangs  
there; i also checked the transmitted packets with wireshark but  
after the handshake no more packets are exchanged


I read in the archives that there were some problems in this area  
and so i tried what was suggested in previous emails


mpirun --mca btl ^openib  -np 2 -host randori -host tatami graph
mpirun --mca pml ob1  --mca btl tcp,self  -np 2 -host randori -host  
tatami graph


gives exactly the same output as before (no mpisend/receive)
while the next commands gives something more interesting

mpirun --mca pml cm  --mca btl tcp,self  -np 2 -host randori -host  
tatami graph

--
No available pml components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.

--
[tatami:06619] PML cm cannot be selected
mpirun noticed that job rank 0 with PID 6710 on node randori exited  
on signal 15 (Terminated).


which is not possible as if i do ompi_info --param all there is the  
CM pml component


 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)


my test program is quite simple, just a couple of MPI_Send and  
MPI_Recv (just after the signature)

do you have any ideas that might help me?
thanks a lot
Vittorio


#include "mpi.h"
#include 
#include 
#include 
#include 

#define M_COL 4096
#define M_ROW 524288
#define NUM_MSG 25

unsigned long int  gigamatrix[M_ROW][M_COL];

int main (int argc, char *argv[]) {
int numtasks, rank, dest, source, rc, tmp, count, tag=1;
unsigned long int  exp, exchanged;
unsigned long int i, j, e;
unsigned long matsize;
MPI_Status Stat;
struct timeval timing_start, timing_end;
double inittime = 0;
long int totaltime = 0;

MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);


if (rank == 0) {
fprintf (stderr, "Master thread reporting\n", numtasks - 1);
matsize = (long) M_COL * M_ROW / 64;
fprintf (stderr, "matrix size %d kB, time is in [us]\n",  
matsize);


source = 1;
dest = 1;

/*warm up phase*/
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag,  
MPI_COMM_WORLD, &Stat);

rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag,  
MPI_COMM_WORLD, &Stat);

rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);

for (e = 0; e < NUM_MSG; e++) {
exp = pow (2, e);
exchanged = 64 * exp;

/*timing of ops*/
gettimeofday (&timing_start, NULL);
rc = MPI_Send (&gigamatrix[0], exchanged,  
MPI_UNSIGNED_LONG, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&gigamatrix[0], exchanged,  
MPI_UNSIGNED_LONG, source, tag, MPI_COMM_WORLD, &Stat);

gettimeofday (&timing_end, NULL);

totaltime = (timing_end.tv_sec - timing_start.tv_sec) *  
100 + (timing_end.tv_usec - timing_start.t

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Matt Hughes
2009/2/26 Brett Pemberton :
> [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
> to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
> number 12 for wr_id 38996224 opcode 0 qp_idx 0

What OS are you using?  I've seen this error and many other Infiniband
related errors on RedHat enterprise linux 4 update 4, with ConnectX
cards and various versions of OFED, up to version 1.3.  Depending on
the MCA parameters, I also see hangs often enough to make native
Infiniband unusable on this OS.

However, the openib btl works just fine on the same hardware and the
same OFED/OpenMPI stack when used with Centos 4.6.  I suspect there
may be something about the kernel that is contributing to these
problems, but I haven't had a chance to test the kernel from 4.6 on
4.4.

mch


Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Åke Sandgren
On Fri, 2009-02-27 at 09:54 -0700, Matt Hughes wrote:
> 2009/2/26 Brett Pemberton :
> > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
> > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
> > number 12 for wr_id 38996224 opcode 0 qp_idx 0
> 
> What OS are you using?  I've seen this error and many other Infiniband
> related errors on RedHat enterprise linux 4 update 4, with ConnectX
> cards and various versions of OFED, up to version 1.3.  Depending on
> the MCA parameters, I also see hangs often enough to make native
> Infiniband unusable on this OS.
> 
> However, the openib btl works just fine on the same hardware and the
> same OFED/OpenMPI stack when used with Centos 4.6.  I suspect there
> may be something about the kernel that is contributing to these
> problems, but I haven't had a chance to test the kernel from 4.6 on
> 4.4.

We see these errors fairly frequently on our CentOS 5.2 system with
Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2
uses. Has anyone tested that with the 1.4 OFED stack?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Pavel Shamis (Pasha)
Usually "retry exceeded error" points to some network issues, like bad 
cable or some bad connector. You may use ibdiagnet tool for the network 
debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.


Pasha

Brett Pemberton wrote:

Hey,

I've had a couple of errors recently, of the form:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from 
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
-- 


The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

My first thought was to increase the retry count, but it is already at 
maximum.


I've checked connections between the two nodes, and they seem ok

[root@tango090 ~]# ibv_rc_pingpong
  local address:  LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
  remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
1000 iters in 0.07 seconds = 65.74 usec/iter

How can I stop this happening in the future, without increasing the 
retry count?


cheers,

/ Brett



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] defining different values for same environment variable

2009-02-27 Thread Nicolas Deladerriere
Hello

I am looking for a way to set environment variable with different value on
each node before running MPI executable. (not only export the environment
variable !)
Let's consider that I have cluster with two nodes (n001 and n002) and I want
to set the environment variable GMON_OUT_PREFIX with different value on each
node.

I would like to set this variable with the following form :
*nicolas@n001 % setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n`
nicolas@n001 % echo $GMON_OUT_PREFIX
gmon.out_n001*

my mpirun command looks like :
*nicolas@n001 % cat CLUSTER_NODES
n001 slots=1
n002 slots=1
nicolas@n001 % mpirun -np 2 --bynode --hostfile CLUSTER_NODES myexe*

I would like to export the GMON_OUT_PREFIX environment variable in order to
have « gmon.out_n001 » on first node and « gmon.out_n002 » on second one.

I cannot use '-x' option of mpirun command since it is only use to export
(not set) variable.
MPI executable is running on cluster nodes where HOME directory is not
mounted such that I cannot use ~/.cshrc file.

Is there another way to do it?


Regards
Nicolas


Re: [OMPI users] defining different values for same environment variable

2009-02-27 Thread Matt Hughes
2009/2/27 Nicolas Deladerriere :
> I am looking for a way to set environment variable with different value on
> each node before running MPI executable. (not only export the environment
> variable !)

I typically use a script for things like this.  So instead of
specifying your executable directly on the mpirun command line,
instead specify the script.  The script can set the environment
variable, then launch your executable.

#!/bin/csh
setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n`
myexe

mpirun -np 2 --bynode --hostfile CLUSTER_NODES myscript

I'm not sure if that csh syntax is right, but you get the idea.

mch


[OMPI users] Threading fault

2009-02-27 Thread Mahmoud Payami
Dear All,

I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1
with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have
built my application.
The linux box is 2Xamd64 quad. In the middle of running of my application
(after some 15 iterations), I receive the message and stops.
I tried to configure openmpi using "--disable-mpi-threads" but it
automatically assumes "posix".
This problem does not happen in openmpi-1.2.9.
Any comment is highly appreciated.
Best regards,
mahmoud payami


[hpc1:25353] *** Process received signal ***
[hpc1:25353] Signal: Segmentation fault (11)
[hpc1:25353] Signal code: Address not mapped (1)
[hpc1:25353] Failing at address: 0x51
[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e350d96]
[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e3514a8]
[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so
[0x2eb7c72a]
[hpc1:25353] [ 4]
/opt/openmpi131_cc/lib/libopen-pal.so.0(opal_progress+0x89) [0x2b42b7d9]
[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e34d27c]
[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv+0x210)
[0x2af46010]
[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv+0xa4)
[0x2acd6af4]
[hpc1:25353] [ 8]
/opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_+0x13da) [0x513d8a]
[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4]
[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
[hpc1:25353] *** End of error message ***
--
mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on
signal 11 (Segmentation fault).
--


Re: [OMPI users] valgrind problems

2009-02-27 Thread Douglas Guptill
On Thu, Feb 26, 2009 at 08:27:15PM -0700, Justin wrote:
> Also the stable version of openmpi on Debian is 1.2.7rc2.  Are there any 
> known issues with this version and valgrid?

For a now-forgotten reason, I ditched the openmpi that comes on Debian
etch, and installed 1.2.8 in /usr/local.

HTH,
Douglas.


[OMPI users] threading bug?

2009-02-27 Thread Mahmoud Payami
Dear All,

I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1
with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have
built my application.
The linux box is 2Xamd64 quad. In the middle of running of my application
(after some 15 iterations), I receive the message and stops.
I tried to configure openmpi using "--disable-mpi-threads" but it
automatically assumes "posix".
This problem does not happen in openmpi-1.2.9.
Any comment is highly appreciated.
Best regards,
mahmoud payami


[hpc1:25353] *** Process received signal ***
[hpc1:25353] Signal: Segmentation fault (11)
[hpc1:25353] Signal code: Address not mapped (1)
[hpc1:25353] Failing at address: 0x51
[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e350d96]
[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e3514a8]
[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so
[0x2eb7c72a]
[hpc1:25353] [ 4]
/opt/openmpi131_cc/lib/libopen-pal.so.0(opal_progress+0x89) [0x2b42b7d9]
[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
[0x2e34d27c]
[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv+0x210)
[0x2af46010]
[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv+0xa4)
[0x2acd6af4]
[hpc1:25353] [ 8]
/opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_+0x13da) [0x513d8a]
[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4]
[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
[hpc1:25353] *** End of error message ***
--
mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on
signal 11 (Segmentation fault).
--


Re: [OMPI users] defining different values for same environment variable

2009-02-27 Thread Nicolas Deladerriere
Matt,

Thanks for your solution, but I thought about that and it is not really
convenient in my configuration to change the executable on each node.
I would like to change only mpirun command.



2009/2/27 Matt Hughes

>

> 2009/2/27 Nicolas Deladerriere :
> > I am looking for a way to set environment variable with different value
> on
> > each node before running MPI executable. (not only export the environment
> > variable !)
>
> I typically use a script for things like this.  So instead of
> specifying your executable directly on the mpirun command line,
> instead specify the script.  The script can set the environment
> variable, then launch your executable.
>
> #!/bin/csh
> setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n`
> myexe
>
> mpirun -np 2 --bynode --hostfile CLUSTER_NODES myscript
>
> I'm not sure if that csh syntax is right, but you get the idea.
>
> mch
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OMPI, and HPUX

2009-02-27 Thread Jeff Squyres

I don't know if anyone has tried OMPI on HP-UX, sorry.


On Feb 26, 2009, at 9:14 AM, Nader wrote:


Hello,

Does anyone has installed OMPI on a HPUX system?

I do apprciate any info.

Best Regards.

Nader
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Rolf Vandevaart
With further investigation, I have reproduced this problem.  I think I 
was originally testing against a version that was not recent enough.  I 
do not see it with r20594 which is from February 19.  So, something must 
have happened over the last 8 days.  I will try and narrow down the issue.


Rolf

On 02/27/09 09:34, Rolf Vandevaart wrote:


I just tried trunk-1.4a1r20458 and I did not see this error, although my 
configuration was rather different.  I ran across 100 2-CPU sparc nodes, 
np=256, connected with TCP.


Hopefully George's comment helps out with this issue.

One other thought to see whether SGE has anything to do with this is 
create a hostfile and run it outside of SGE.


Rolf

On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh launchers, 
and everything checks out fine. However, this is running under SGE and 
thus using qrsh, so it is possible the SGE support is having a problem.


Perhaps one of the Sun OMPI developers can help here?

Ralph

On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:

It looks like the system doesn't know what nodes the procs are to be 
placed upon. Can you run this with --display-devel-map? That will 
tell us where the system thinks it is placing things.


Thanks
Ralph

On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:


Maybe it's my pine mailer.

This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
shangai nodes running a standard benchmark called stmv.

The basic error message, which occurs 31 times is like:

[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file 
../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595


The mpirun command has long paths in it, sorry. It's invoking a 
special binding
script which in turn lauches the NAMD run. This works on an older 
SVN at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 
256 proc run where
the older SVN hangs indefinitely polling some completion (sm or 
openib). So, I was trying

later SVNs with this 256 proc run, hoping the error would go away.

Here's some of the invocation again. Hope you can read it:

EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE

and, unexpanded

mpirun --prefix $PREFIX -np %PE% $MCA -x 
OMPI_MCA_btl_openib_use_eager_rdma -x 
OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x 
OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2 
stmv.namd


and, expanded

mpirun --prefix 
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron 
-np 256 --mca btl sm,openib,self -x 
OMPI_MCA_btl_openib_use_eager_rdma -x 
OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x 
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts 
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL 
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2 
stmv.namd


This is all via Sun Grid Engine.
The OS as indicated above is SuSE SLES 10 SP2.

DM
On Thu, 26 Feb 2009, Ralph Castain wrote:

I'm sorry, but I can't make any sense of this message. Could you 
provide a
little explanation of what you are doing, what the system looks 
like, what is

supposed to happen, etc? I can barely parse your cmd line...

Thanks
Ralph

On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:


Today's and yesterdays.

1.4a1r20643_svn

+ mpirun --prefix
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6 


4/opteron -np 256 --mca btl sm,openib,self -x
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
269.1.all.q/newhosts
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
/ctmp8/mostyn/IM
SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li 


nux-amd64-MPI/namd2 stmv.namd
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls_
base_default_fns.c at line 595
[s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595

Made with INTEL compilers 10.1.015.


Regards,
Mostyn

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


__

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Jeff Squyres

On Feb 27, 2009, at 12:09 PM, Åke Sandgren wrote:


We see these errors fairly frequently on our CentOS 5.2 system with
Mellanox InfiniHost III cards. The OFED stack is whatever the  
CentOS5.2

uses. Has anyone tested that with the 1.4 OFED stack?



FWIW, I have tested OMPI's openib BTL with several different versions  
of the OFED stack: 1.3, 1.3.1, 1.4, etc.  I used various flavors of  
RHEL4 and RHEL5.


For a variety of uninteresting reasons, I usually uninstall the OS- 
installed verbs stack/drivers and install OFED.  So I don't have much  
experience with the verbs stacks/drivers that ship with the various  
distros.


--
Jeff Squyres
Cisco Systems




Re: [OMPI users] TCP instead of openIB doesn't work

2009-02-27 Thread Vittorio Giovara
Hello, i ve corrected the syntax and added the flag you suggested, but
unfortunately the result doen't change.

randori ~ # mpirun --display-map --mca btl tcp,self  -np 2 -host
randori,tatami graph
[randori:22322]  Map for job: 1Generated by mapping mode: byslot
 Starting vpid: 0Vpid range: 2Num app_contexts: 1
 Data for app_context: index 0app: graph
 Num procs: 2
 Argv[0]: graph
 Env[0]: OMPI_MCA_btl=tcp,self
 Env[1]: OMPI_MCA_rmaps_base_display_map=1
 Env[2]:
OMPI_MCA_orte_precondition_transports=d45d47f6e1ed0e0b-691fd7f24609dec3
 Env[3]: OMPI_MCA_rds=proxy
 Env[4]: OMPI_MCA_ras=proxy
 Env[5]: OMPI_MCA_rmaps=proxy
 Env[6]: OMPI_MCA_pls=proxy
 Env[7]: OMPI_MCA_rmgr=proxy
 Working dir: /root (user: 0)
 Num maps: 1
 Data for app_context_map: Type: 1Data: randori,tatami
 Num elements in nodes list: 2
 Mapped node:
 Cell: 0Nodename: randoriLaunch id: -1Username: NULL
 Daemon name:
 Data type: ORTE_PROCESS_NAMEData Value: NULL
 Oversubscribed: FalseNum elements in procs list: 1
 Mapped proc:
 Proc Name:
 Data type: ORTE_PROCESS_NAMEData Value: [0,1,0]
 Proc Rank: 0Proc PID: 0App_context index: 0

 Mapped node:
 Cell: 0Nodename: tatamiLaunch id: -1Username: NULL
 Daemon name:
 Data type: ORTE_PROCESS_NAMEData Value: NULL
 Oversubscribed: FalseNum elements in procs list: 1
 Mapped proc:
 Proc Name:
 Data type: ORTE_PROCESS_NAMEData Value: [0,1,1]
 Proc Rank: 1Proc PID: 0App_context index: 0
Master thread reporting
matrix size 33554432 kB, time is in [us]

(and then it just hangs)

Vittorio

On Fri, Feb 27, 2009 at 6:00 PM,  wrote:

>
> Date: Fri, 27 Feb 2009 08:22:17 -0700
> From: Ralph Castain 
> Subject: Re: [OMPI users] TCP instead of openIB doesn't work
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> I'm not entirely sure what is causing the problem here, but one thing
> does stand out. You have specified two -host options for the same
> application - this is not our normal syntax. The usual way of
> specifying this would be:
>
> mpirun  --mca btl tcp,self  -np 2 -host randori,tatami hostname
>
> I'm not entirely sure what OMPI does when it gets two separate -host
> arguments - could be equivalent to the above syntax, but could also
> cause some unusual behavior.
>
> Could you retry your job with the revised syntax? Also, could you add
> --display-map to your mpirun cmd line? This will tell us where OMPI
> thinks the procs are going, and a little info about how it interpreted
> your cmd line.
>
> Thanks
> Ralph
>
>
> On Feb 27, 2009, at 8:00 AM, Vittorio Giovara wrote:
>
> > Hello, i'm posting here another problem of my installation
> > I wanted to benchmark the differences between tcp and openib transport
> >
> > if i run a simple non mpi application i get
> > randori ~ # mpirun  --mca btl tcp,self  -np 2 -host randori -host
> > tatami hostname
> > randori
> > tatami
> >
> > but as soon as i switch to my benchmark program i have
> > mpirun  --mca btl tcp,self  -np 2 -host randori -host tatami graph
> > Master thread reporting
> > matrix size 33554432 kB, time is in [us]
> >
> > and instead of starting the send/receive functions it just hangs
> > there; i also checked the transmitted packets with wireshark but
> > after the handshake no more packets are exchanged
> >
> > I read in the archives that there were some problems in this area
> > and so i tried what was suggested in previous emails
> >
> > mpirun --mca btl ^openib  -np 2 -host randori -host tatami graph
> > mpirun --mca pml ob1  --mca btl tcp,self  -np 2 -host randori -host
> > tatami graph
> >
> > gives exactly the same output as before (no mpisend/receive)
> > while the next commands gives something more interesting
> >
> > mpirun --mca pml cm  --mca btl tcp,self  -np 2 -host randori -host
> > tatami graph
> >
> --
> > No available pml components were found!
> >
> > This means that there are no components of this type installed on your
> > system or all the components reported that they could not be used.
> >
> > This is a fatal error; your MPI process is likely to abort.  Check the
> > output of the "ompi_info" command and ensure that components of this
> > type are available on your system.  You may also wish to check the
> > value of the "component_path" MCA parameter and ensure that it has at
> > least one directory that contains valid MCA components.
> >
> >
> --
> > [tatami:06619] PML cm cannot be selected
> > mpirun noticed that job rank 0 with PID 6710 on node randori exit

Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Jeff Squyres
Unfortunately, I think I have reproduced the problem as well -- with  
SVN trunk HEAD (r20655):


[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack  
failed in file base/odls_base_default_fns.c at line 566

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

Notice that I'm not trying to run an MPI app -- it's just "uptime".

The following things seem to be necessary to make this error occur for  
me:


1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation

If I remove any of those, it seems to run file


On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:

With further investigation, I have reproduced this problem.  I think  
I was originally testing against a version that was not recent  
enough.  I do not see it with r20594 which is from February 19.  So,  
something must have happened over the last 8 days.  I will try and  
narrow down the issue.


Rolf

On 02/27/09 09:34, Rolf Vandevaart wrote:
I just tried trunk-1.4a1r20458 and I did not see this error,  
although my configuration was rather different.  I ran across 100 2- 
CPU sparc nodes, np=256, connected with TCP.

Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with this  
is create a hostfile and run it outside of SGE.

Rolf
On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh  
launchers, and everything checks out fine. However, this is  
running under SGE and thus using qrsh, so it is possible the SGE  
support is having a problem.


Perhaps one of the Sun OMPI developers can help here?

Ralph

On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:

It looks like the system doesn't know what nodes the procs are to  
be placed upon. Can you run this with --display-devel-map? That  
will tell us where the system thinks it is placing things.


Thanks
Ralph

On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:


Maybe it's my pine mailer.

This is a NAMD run on 256 procs across 32 dual-socket quad-core  
AMD

shangai nodes running a standard benchmark called stmv.

The basic error message, which occurs 31 times is like:

[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in  
file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at  
line 595


The mpirun command has long paths in it, sorry. It's invoking a  
special binding
script which in turn lauches the NAMD run. This works on an  
older SVN at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for  
this 256 proc run where
the older SVN hangs indefinitely polling some completion (sm or  
openib). So, I was trying

later SVNs with this 256 proc run, hoping the error would go away.

Here's some of the invocation again. Hope you can read it:

EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE

and, unexpanded

mpirun --prefix $PREFIX -np %PE% $MCA -x  
OMPI_MCA_btl_openib_use_eager_rdma -x  
OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit  
-x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER  
$NAMD2 stmv.namd


and, expanded

mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/ 
intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 -- 
mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x  
OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit  
-x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/ 
newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/ 
mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/ 
intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/ 
NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd


This is all via Sun Grid Engine.
The OS as indicated above is SuSE SLES 10 SP2.

DM
On Thu, 26 Feb 2009, Ralph Castain wrote:

I'm sorry, but I can't make any sense of this message. Could  
you provide a
little explanation of what you are doing, what the system looks  
like, what is

supposed to happen, etc? I can barely parse your cmd line...

Thanks
Ralph

On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:


Today's and yesterdays.

1.4a1r20643_svn

+ mpirun --prefix
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/ 
openib/suse_sles_10/x86_6

4/opteron -np 256 --mca btl sm,openib,self -x
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
269.1.all.q/newhosts
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
/ctmp8/mostyn/IM
SC/bench_intel_openmpi_I_shang2/ 
intel-10.1.015_ofed

Re: [OMPI users] TCP instead of openIB doesn't work

2009-02-27 Thread Jeff Squyres

I notice the following:

- you're creating an *enormous* array on the stack.  you might be  
better allocating it on the heap.
- the value of "exchanged" will quickly grow beyond 2^31 (i.e.,  
MAX_INT) which is the max that the MPI API can handle.  Bad Things can/ 
will happen beyond that value (i.e., you're keeping the value of  
"exchanged" in a long unsigned int, but MPI_Send and MPI_Recv only  
take an int).



On Feb 27, 2009, at 10:00 AM, Vittorio Giovara wrote:


Hello, i'm posting here another problem of my installation
I wanted to benchmark the differences between tcp and openib transport

if i run a simple non mpi application i get
randori ~ # mpirun  --mca btl tcp,self  -np 2 -host randori -host  
tatami hostname

randori
tatami

but as soon as i switch to my benchmark program i have
mpirun  --mca btl tcp,self  -np 2 -host randori -host tatami graph
Master thread reporting
matrix size 33554432 kB, time is in [us]

and instead of starting the send/receive functions it just hangs  
there; i also checked the transmitted packets with wireshark but  
after the handshake no more packets are exchanged


I read in the archives that there were some problems in this area  
and so i tried what was suggested in previous emails


mpirun --mca btl ^openib  -np 2 -host randori -host tatami graph
mpirun --mca pml ob1  --mca btl tcp,self  -np 2 -host randori -host  
tatami graph


gives exactly the same output as before (no mpisend/receive)
while the next commands gives something more interesting

mpirun --mca pml cm  --mca btl tcp,self  -np 2 -host randori -host  
tatami graph

--
No available pml components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.

--
[tatami:06619] PML cm cannot be selected
mpirun noticed that job rank 0 with PID 6710 on node randori exited  
on signal 15 (Terminated).


which is not possible as if i do ompi_info --param all there is the  
CM pml component


 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)


my test program is quite simple, just a couple of MPI_Send and  
MPI_Recv (just after the signature)

do you have any ideas that might help me?
thanks a lot
Vittorio


#include "mpi.h"
#include 
#include 
#include 
#include 

#define M_COL 4096
#define M_ROW 524288
#define NUM_MSG 25

unsigned long int  gigamatrix[M_ROW][M_COL];

int main (int argc, char *argv[]) {
int numtasks, rank, dest, source, rc, tmp, count, tag=1;
unsigned long int  exp, exchanged;
unsigned long int i, j, e;
unsigned long matsize;
MPI_Status Stat;
struct timeval timing_start, timing_end;
double inittime = 0;
long int totaltime = 0;

MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);


if (rank == 0) {
fprintf (stderr, "Master thread reporting\n", numtasks - 1);
matsize = (long) M_COL * M_ROW / 64;
fprintf (stderr, "matrix size %d kB, time is in [us]\n",  
matsize);


source = 1;
dest = 1;

/*warm up phase*/
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag,  
MPI_COMM_WORLD, &Stat);

rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag,  
MPI_COMM_WORLD, &Stat);

rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);

for (e = 0; e < NUM_MSG; e++) {
exp = pow (2, e);
exchanged = 64 * exp;

/*timing of ops*/
gettimeofday (&timing_start, NULL);
rc = MPI_Send (&gigamatrix[0], exchanged,  
MPI_UNSIGNED_LONG, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&gigamatrix[0], exchanged,  
MPI_UNSIGNED_LONG, source, tag, MPI_COMM_WORLD, &Stat);

gettimeofday (&timing_end, NULL);

totaltime = (timing_end.tv_sec - timing_start.tv_sec) *  
100 + (timing_end.tv_usec - timing_start.tv_usec);

memset (&timing_start, 0, sizeof(struct timeval));
memset (&timing_end, 0, sizeof(struct timeval));
fprintf (stdout, "%d kB\t%d\n", exp, totaltime);
}

fprintf(stderr, "task complete\n");

} else {
if (rank >= 1) {