[OMPI users] MPI_Waitany segfaults or (maybe) hangs
Dear MPI users, I am struggling against the bad behaviour of a MPI code. These are the basic informations: a) fortran intel11 or intel 12 and OpenMPI 1.4.1 and 1.4.3 give the same problem. activating -traceback compiler option, I see the program stops at MPI_Waitany. MPI_Waitany waits for the completion of an array of MPI_IRecv: looping for the number of array components at the end all receives should be completed. The programs stops at unpredictable points (after 1 or 5 or 24 hours of computation). Sometimes I have sigsegv: mca_btl_openib.so 2BA74D29D181 Unknown Unknown Unknown mca_btl_openib.so 2BA74D29C6FF Unknown Unknown Unknown mca_btl_openib.so 2BA74D29C033 Unknown Unknown Unknown libopen-pal.so.0 2BA74835C3E6 Unknown Unknown Unknown libmpi.so.0 2BA747E485AD Unknown Unknown Unknown libmpi.so.0 2BA747E7857D Unknown Unknown Unknown libmpi_f77.so.0 2BA747C047C4 Unknown Unknown Unknown cosa.mpi 004F856B waitanymessages_ 1292 parallelutils.f cosa.mpi 004C8044 cutman_q_ 2084 bc.f cosa.mpi 00413369 smooth_ 2029 cosa.f cosa.mpi 00410782 mg_ 810 cosa.f cosa.mpi 0040FB78 MAIN__ 537 cosa.f cosa.mpi 0040C1FC Unknown Unknown Unknown libc.so.6 2BA7490AE994 Unknown Unknown Unknown cosa.mpi 0040C109 Unknown Unknown Unknown -- mpirun has exited due to process rank 34 with PID 10335 on node neo251 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- Waitanymessages is just a wrapper of MPI_Waitany. Sometimes, the run stops writing anything on screen and I do not know what is happening (probably MPI_Waitany hangs). Before reaching segafault or hanging, results are always correct, as checked using the serial version of the code. b) The problem occurs only using openib (using TCP/IP it works) and only using more than one node on our main cluster . Trying many possibile workarounds, I found that running using: -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca btl_openib_flags 1 the problems seems not to occur. I would be very thankful to anyone which can help me to make us sure there is no bug in the code and, anyway, to discover the reason of such a "dangerous" behaviour. I can give any further information if needed, and I apologize if the post is not enough clear or complete. regards, Francesco
[OMPI users] mpirun should run with just the localhost interface on win?
on winxp, with the following net setup (just localhost, is it on?) C:\trunk-build-release>ipconfig /all Windows IP Configuration Host Name . . . . . . . . . . . . : SOMEHOSTNAME Primary Dns Suffix . . . . . . . : DOMAIN.SOMECO.COM Node Type . . . . . . . . . . . . : Hybrid IP Routing Enabled. . . . . . . . : No WINS Proxy Enabled. . . . . . . . : No Ethernet adapter Wireless Network Connection: Media State . . . . . . . . . . . : Media disconnected Description . . . . . . . . . . . : Intel(R) WiFi Link 5100 AGN Physical Address. . . . . . . . . : SOMEMACADDRESS C:\Trading\trunk-build-release>route print === Interface List 0x1 ... MS TCP Loopback interface 0x2 ...00 24 d6 10 05 4e .. Intel(R) WiFi Link 5100 AGN - Packet Scheduler Miniport === === Active Routes: Network DestinationNetmask Gateway Interface Metric 127.0.0.0255.0.0.0127.0.0.1 127.0.0.1 1 255.255.255.255 255.255.255.255 255.255.255.255 2 1 === Persistent Routes: None my mpirun fails as: mpirun -np 1 .\nhui\Release\nhui.exe : -np 1 .\nhcomp\Release\nhcomp.exe [SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at line 215 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\openmpi-1.5.4\orte\runtime\orte_init.c at line 128 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [SOMEHOSTNAME:04392] [[1866,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\..\..\openmpi-1.5.4\orte\tools\orterun\orterun.c at line 616 When I turn on the network, and there is: C:\ >route print === Interface List 0x1 ... MS TCP Loopback interface 0x2 ...00 24 d6 10 05 4e .. Intel(R) WiFi Link 5100 AGN - Packet Scheduler Miniport === === Active Routes: Network DestinationNetmask Gateway Interface Metric 0.0.0.0 0.0.0.0192.168.1.254192.168.1.88 25 127.0.0.0255.0.0.0127.0.0.1 127.0.0.1 1 192.168.1.0255.255.255.0 192.168.1.88192.168.1.88 25 192.168.1.88 255.255.255.255127.0.0.1 127.0.0.1 25 192.168.1.255 255.255.255.255 192.168.1.88192.168.1.88 25 224.0.0.0240.0.0.0 192.168.1.88192.168.1.88 25 255.255.255.255 255.255.255.255 192.168.1.88192.168.1.88 1 Default Gateway: 192.168.1.254 === Persistent Routes: None mpirun works