[OMPI users] Problem with gateway between 2 hosts

2008-06-30 Thread Geoffroy Pignot
Hi,

Does anybody face problems running Openmpi on two hosts with different
networks (gateway to reach the other) ?
Let say compil02 ip adress is 172.3.9.10 and r009n001 is 10.160.4.1

There is no problem with MPI_init free executables (for example hostname)

compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix
/tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 hostname : -np 1 -host
r009n001 hostname
r009n001
compil02

But as soon as I try a simple hello world , it 's crashing with the
following error message.
Please note that when I try to run hello between r009n001 (10.160.4.1) and
r009n002 (10.160.4.2), it works fine

Thanks in advance for your help.
Regards

Geoffroy


PS: same error with openmpi v1.2.5


compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix
/tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 /tmp/hello : -np 1 -host
r009n001 /tmp/hello
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)


Re: [OMPI users] Problem with gateway between 2 hosts

2008-06-30 Thread Reuti

Hi,

Am 30.06.2008 um 17:29 schrieb Geoffroy Pignot:


Hi,

Does anybody face problems running Openmpi on two hosts with  
different networks (gateway to reach the other) ?

Let say compil02 ip adress is 172.3.9.10 and r009n001 is 10.160.4.1

There is no problem with MPI_init free executables (for example  
hostname)


compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/ 
openmpi-1.2.2 -np 1 -host compil02 hostname : -np 1 -host r009n001  
hostname

r009n001
compil02

But as soon as I try a simple hello world , it 's crashing with the  
following error message.
Please note that when I try to run hello between r009n001  
(10.160.4.1) and r009n002 (10.160.4.2), it works fine


are the 172.x.y.z nodes behind a NAT (hence the communication back  
isn't possible - only the stdout from the rsh/ssh is working in this  
case)?


-- Reuti



Thanks in advance for your help.
Regards

Geoffroy


PS: same error with openmpi v1.2.5


compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/ 
openmpi-1.2.2 -np 1 -host compil02 /tmp/hello : -np 1 -host  
r009n001 /tmp/hello
-- 


Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
-- 

-- 

It looks like MPI_INIT failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
-- 

-- 


Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
-- 

-- 

It looks like MPI_INIT failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
-- 


*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Code Seg Faults in Devel Series

2008-06-30 Thread Doug Roberts


This is resolved ;)  On our system, for releases after 1.3a1r18423
up to and including the latest release in the 1.4 trunk, configure
requires the --enable-mpi-threads option to be explicitly specified
for the cpi.c problem to successfully run, as shown here:

# ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 \
   --enable-mpi-threads --with-gm=/opt/gm

# mpirun -np 4 -machinefile ~/bruhosts a.out
Process 1 of 4 is on bru27
Process 3 of 4 is on bru27
Process 0 of 4 is on bru25
Process 2 of 4 is on bru25
pi is approximately 3.1415926544231239, Error is 0.0807
wall clock time = 0.004372

Otherwise not doing so yields the segfault as shown before:

# ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 --with-gm=/opt/gm

# mpirun -np 4 -machinefile ~/bruhosts a.out
Process 1 of 4 is on bru27
Process 3 of 4 is on bru27
Process 0 of 4 is on bru25
[bru25:30651] *** Process received signal ***
[bru25:30651] Signal: Segmentation fault (11)
[bru25:30651] Signal code: Address not mapped (1)
[bru25:30651] Failing at address: 0x9
Process 2 of 4 is on bru25
[bru25:30651] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f7e420]
[bru25:30651] [ 1] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_btl_gm.so 
[0x2a97980fb9]
[bru25:30651] [ 2] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so 
[0x2a97672c1d]
[bru25:30651] [ 3] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so 
[0x2a97667753]
[bru25:30651] [ 4] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so 
[0x2a9857db1c]
[bru25:30651] [ 5] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so 
[0x2a9857de27]
[bru25:30651] [ 6] 
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so 
[0x2a98573eec]
[bru25:30651] [ 7] 
/opt/sharcnet/testing/openmpi/current/lib/libmpi.so.0(PMPI_Bcast+0x13e) 
[0x2a956b405e]

[bru25:30651] [ 8] a.out(main+0xd6) [0x400d0f]
[bru25:30651] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) 
[0x2a960a34bb]

[bru25:30651] [10] a.out [0x400b7a]
[bru25:30651] *** End of error message ***
[bru34:06039] 
--
mpirun noticed that process rank 0 with PID 30651 on node bru25 exited on 
signal 11 (Segmentation fault).

--


On Fri, 27 Jun 2008, Doug Roberts wrote:



Hi, I am trying to use the latest release of v1.3 to test with BLCR
however i just noticed that sometime after 1.3a1r18423 the standard
mpich sample code (cpi.c) stopped working on our rel4 based myrinet
gm clusters which raises some concern.

Please find attached: gm_board_info.out, ompi_info--all.out,
ompi_info--param-btl-gm.out and config-1.4a1r18743.log bundled
in mpi-output.tar.gz for your analysis.

Below shows the sample code runs with 1.3a1r18423, but crashes with
1.3a1r18740 and further crashes with all snapshots greater than
1.3a1r18423 i have tested.