[OMPI users] MPI_Waitany segfaults or (maybe) hangs

2011-10-08 Thread Francesco Salvadore
Dear MPI users, 

I am struggling against the bad behaviour of a MPI code. These are the 
basic informations: 

a) fortran intel11 or intel 12 and OpenMPI 1.4.1 and 1.4.3 give the same 
problem. activating -traceback compiler option, I see the program stops 
at MPI_Waitany. MPI_Waitany waits for the completion of an array of 
MPI_IRecv: looping for the number of array components at the end all 
receives should be completed. 
The programs stops at unpredictable points (after 1 or 5 or 24 hours of 
computation). Sometimes I have sigsegv: 

mca_btl_openib.so  2BA74D29D181  Unknown   Unknown  Unknown 
mca_btl_openib.so  2BA74D29C6FF  Unknown   Unknown  Unknown 
mca_btl_openib.so  2BA74D29C033  Unknown   Unknown  Unknown 
libopen-pal.so.0   2BA74835C3E6  Unknown   Unknown  Unknown 
libmpi.so.0    2BA747E485AD  Unknown   Unknown  Unknown 
libmpi.so.0    2BA747E7857D  Unknown   Unknown  Unknown 
libmpi_f77.so.0    2BA747C047C4  Unknown   Unknown  Unknown 
cosa.mpi   004F856B  waitanymessages_ 1292  
parallelutils.f 
cosa.mpi   004C8044  cutman_q_    2084  bc.f 
cosa.mpi   00413369  smooth_  2029  cosa.f 
cosa.mpi   00410782  mg_   810  cosa.f 
cosa.mpi   0040FB78  MAIN__    537  cosa.f 
cosa.mpi   0040C1FC  Unknown   Unknown  Unknown 
libc.so.6  2BA7490AE994  Unknown   Unknown  Unknown 
cosa.mpi   0040C109  Unknown   Unknown  Unknown 
-- 
mpirun has exited due to process rank 34 with PID 10335 on 
node neo251 exiting without calling "finalize". This may 
have caused other processes in the application to be 
terminated by signals sent by mpirun (as reported here). 
-- 

Waitanymessages is just a wrapper of MPI_Waitany. Sometimes, the run 
stops writing anything on screen and I do not know what is happening 
(probably MPI_Waitany hangs). Before reaching segafault or hanging, 
results are always correct, as checked using the serial version of the 
code. 

b) The problem occurs only using openib (using TCP/IP it works) and only 
using more than one node on our main cluster . Trying many possibile 
workarounds, I found that running using: 

-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca 
btl_openib_flags 1 

the problems seems not to occur. 

I would be very thankful to anyone which can help me to make us sure 
there is no bug in the code and, anyway, to discover the reason of such 
a "dangerous" behaviour. 

I can give any further information if needed, and I apologize if the 
post is not enough clear or complete. 

regards, 
Francesco 

[OMPI users] mpirun should run with just the localhost interface on win?

2011-10-08 Thread MM
on winxp, with the following net setup (just localhost, is it on?)


C:\trunk-build-release>ipconfig /all

Windows IP Configuration

Host Name . . . . . . . . . . . . : SOMEHOSTNAME
Primary Dns Suffix  . . . . . . . : DOMAIN.SOMECO.COM
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter Wireless Network Connection:

Media State . . . . . . . . . . . : Media disconnected
Description . . . . . . . . . . . : Intel(R) WiFi Link 5100 AGN
Physical Address. . . . . . . . . : SOMEMACADDRESS

C:\Trading\trunk-build-release>route print
===
Interface List
0x1 ... MS TCP Loopback interface
0x2 ...00 24 d6 10 05 4e .. Intel(R) WiFi Link 5100 AGN - Packet
Scheduler Miniport
===
===
Active Routes:
Network DestinationNetmask  Gateway   Interface  Metric
127.0.0.0255.0.0.0127.0.0.1   127.0.0.1   1
  255.255.255.255  255.255.255.255  255.255.255.255   2   1
===
Persistent Routes:
  None


my mpirun fails as:

mpirun -np 1 .\nhui\Release\nhui.exe : -np 1 .\nhcomp\Release\nhcomp.exe


[SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at line 215
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_rml_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\openmpi-1.5.4\orte\runtime\orte_init.c at line 128
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\..\..\openmpi-1.5.4\orte\tools\orterun\orterun.c at line 616





When I turn on the network, and there is:
C:\ >route print
===
Interface List
0x1 ... MS TCP Loopback interface
0x2 ...00 24 d6 10 05 4e .. Intel(R) WiFi Link 5100 AGN - Packet
Scheduler Miniport
===
===
Active Routes:
Network DestinationNetmask  Gateway   Interface  Metric
  0.0.0.0  0.0.0.0192.168.1.254192.168.1.88   25
127.0.0.0255.0.0.0127.0.0.1   127.0.0.1   1
  192.168.1.0255.255.255.0 192.168.1.88192.168.1.88   25
 192.168.1.88  255.255.255.255127.0.0.1   127.0.0.1   25
192.168.1.255  255.255.255.255 192.168.1.88192.168.1.88   25
224.0.0.0240.0.0.0 192.168.1.88192.168.1.88   25
  255.255.255.255  255.255.255.255 192.168.1.88192.168.1.88   1
Default Gateway: 192.168.1.254
===
Persistent Routes:
  None


mpirun works