About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect fail for node-14:0 with key aaaaffff (error Endpoint closed or not connectable!) [node-14:19261] mx_connect fail for node-14:0 with key aaaaffff (error Endpoint closed or not connectable!) ... Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. I'm looking into the $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5 is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 pi is approximately 3.1415926544231225, Error is 0.0000000008333294 wall clock time = 0.019305 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 4 additional processes aborted (not shown)
Or sometimes I'll get this error, just depending upon the number of procs ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi Signal:7 info.si_errno:0(Success) si_code:2() Failing at addr:0x2aaaaaaab000 [0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+ 0x1f) [0x2b9b7fa52d1f] [1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871] [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm ap_init+0x1e3) [0x2b9b8270ef83] [4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so [0x2b9b8260d0ff] [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea te+0x70) [0x2b9b7f7afac0] [6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p rocs_same_base_addr+0x907) [0x2b9b83070517] [7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p rocs+0x206) [0x2b9b82d5f576] [8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add _procs+0xe3) [0x2b9b82a2d0a3] [9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697) [0x2b9b7f77be07] [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83) [0x2b9b7f79c943] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134] [13] func:./cpi [0x400bd9] *** End of error message *** Process 4 of 7 is on node-2 Process 5 of 7 is on node-2 Process 6 of 7 is on node-2 Process 0 of 7 is on node-1 Process 1 of 7 is on node-1 Process 2 of 7 is on node-1 Process 3 of 7 is on node-1 pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.009331 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b4ba33652be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b8685aba2be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 6 additional processes aborted (not shown) Ok, so I take it one is down. Would this be the cause for all the different errors I'm seeing? $ fm_status FMS Fabric status 17 hosts known 16 FMAs found 3 un-ACKed alerts Mapping is complete, last map generated by node-20 Database generation not yet complete. ________________________________ From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 2:52 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Hi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc. I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows ... node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose.
node-2.out
Description: node-2.out
node-1.out
Description: node-1.out