[OMPI users] Problem with gateway between 2 hosts
Hi, Does anybody face problems running Openmpi on two hosts with different networks (gateway to reach the other) ? Let say compil02 ip adress is 172.3.9.10 and r009n001 is 10.160.4.1 There is no problem with MPI_init free executables (for example hostname) compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 hostname : -np 1 -host r009n001 hostname r009n001 compil02 But as soon as I try a simple hello world , it 's crashing with the following error message. Please note that when I try to run hello between r009n001 (10.160.4.1) and r009n002 (10.160.4.2), it works fine Thanks in advance for your help. Regards Geoffroy PS: same error with openmpi v1.2.5 compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 /tmp/hello : -np 1 -host r009n001 /tmp/hello -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye)
Re: [OMPI users] Problem with gateway between 2 hosts
Hi, Am 30.06.2008 um 17:29 schrieb Geoffroy Pignot: Hi, Does anybody face problems running Openmpi on two hosts with different networks (gateway to reach the other) ? Let say compil02 ip adress is 172.3.9.10 and r009n001 is 10.160.4.1 There is no problem with MPI_init free executables (for example hostname) compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/ openmpi-1.2.2 -np 1 -host compil02 hostname : -np 1 -host r009n001 hostname r009n001 compil02 But as soon as I try a simple hello world , it 's crashing with the following error message. Please note that when I try to run hello between r009n001 (10.160.4.1) and r009n002 (10.160.4.2), it works fine are the 172.x.y.z nodes behind a NAT (hence the communication back isn't possible - only the stdout from the rsh/ssh is working in this case)? -- Reuti Thanks in advance for your help. Regards Geoffroy PS: same error with openmpi v1.2.5 compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix /tmp/HALMPI/ openmpi-1.2.2 -np 1 -host compil02 /tmp/hello : -np 1 -host r009n001 /tmp/hello -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Code Seg Faults in Devel Series
This is resolved ;) On our system, for releases after 1.3a1r18423 up to and including the latest release in the 1.4 trunk, configure requires the --enable-mpi-threads option to be explicitly specified for the cpi.c problem to successfully run, as shown here: # ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 \ --enable-mpi-threads --with-gm=/opt/gm # mpirun -np 4 -machinefile ~/bruhosts a.out Process 1 of 4 is on bru27 Process 3 of 4 is on bru27 Process 0 of 4 is on bru25 Process 2 of 4 is on bru25 pi is approximately 3.1415926544231239, Error is 0.0807 wall clock time = 0.004372 Otherwise not doing so yields the segfault as shown before: # ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 --with-gm=/opt/gm # mpirun -np 4 -machinefile ~/bruhosts a.out Process 1 of 4 is on bru27 Process 3 of 4 is on bru27 Process 0 of 4 is on bru25 [bru25:30651] *** Process received signal *** [bru25:30651] Signal: Segmentation fault (11) [bru25:30651] Signal code: Address not mapped (1) [bru25:30651] Failing at address: 0x9 Process 2 of 4 is on bru25 [bru25:30651] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f7e420] [bru25:30651] [ 1] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_btl_gm.so [0x2a97980fb9] [bru25:30651] [ 2] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a97672c1d] [bru25:30651] [ 3] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a97667753] [bru25:30651] [ 4] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a9857db1c] [bru25:30651] [ 5] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a9857de27] [bru25:30651] [ 6] /opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a98573eec] [bru25:30651] [ 7] /opt/sharcnet/testing/openmpi/current/lib/libmpi.so.0(PMPI_Bcast+0x13e) [0x2a956b405e] [bru25:30651] [ 8] a.out(main+0xd6) [0x400d0f] [bru25:30651] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a960a34bb] [bru25:30651] [10] a.out [0x400b7a] [bru25:30651] *** End of error message *** [bru34:06039] -- mpirun noticed that process rank 0 with PID 30651 on node bru25 exited on signal 11 (Segmentation fault). -- On Fri, 27 Jun 2008, Doug Roberts wrote: Hi, I am trying to use the latest release of v1.3 to test with BLCR however i just noticed that sometime after 1.3a1r18423 the standard mpich sample code (cpi.c) stopped working on our rel4 based myrinet gm clusters which raises some concern. Please find attached: gm_board_info.out, ompi_info--all.out, ompi_info--param-btl-gm.out and config-1.4a1r18743.log bundled in mpi-output.tar.gz for your analysis. Below shows the sample code runs with 1.3a1r18423, but crashes with 1.3a1r18740 and further crashes with all snapshots greater than 1.3a1r18423 i have tested.