Re: [OMPI users] Ompi failing on mx only

Grobe, Gary L. (JSC-EV)[ESCG] Tue, 2 Jan 2007 16:44:56 -0500

About the -x, I've been trying it both ways and prefer the latter, and
results for either are the same. But it's value is correct. I've
attached the ompi_info from node-1 and node-2. Sorry for not zipping
them, but they were small and I think I'd have firewall issues.
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile
./h13-15 -np 6 --mca pml cm ./cpi 
[node-14:19260] mx_connect fail for node-14:0 with key aaaaffff (error
Endpoint closed or not connectable!)
[node-14:19261] mx_connect fail for node-14:0 with key aaaaffff (error
Endpoint closed or not connectable!)
...
 
Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
sometimes it actually worked once. But now I can't reproduce it and it's
throwing sig 7's, 11's, and 4's depending upon the number of procs I
give it. But now that you mention mapper, I take it that's what
SEGV_MAPERR might be referring to. I'm looking into the 
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl
mx,self ./cpi 
Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process 1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.0000000008333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
4 additional processes aborted (not shown)


Or sometimes I'll get this error, just depending upon the number of
procs ...
 
 mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl
mx,self ./cpi 
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x2aaaaaaab000
[0]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+
0x1f) [0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3]
func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm
ap_init+0x1e3) [0x2b9b8270ef83]
[4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
[0x2b9b8260d0ff]
[5]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea
te+0x70) [0x2b9b7f7afac0]
[6]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p
rocs_same_base_addr+0x907) [0x2b9b83070517]
[7]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p
rocs+0x206) [0x2b9b82d5f576]
[8]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add
_procs+0xe3) [0x2b9b82a2d0a3]
[9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697)
[0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
6 additional processes aborted (not shown)

 
Ok, so I take it one is down. Would this be the cause for all the
different errors I'm seeing?
 
$ fm_status 
FMS Fabric status
 
17      hosts known
16      FMAs found
3       un-ACKed alerts
Mapping is complete, last map generated by node-20
Database generation not yet complete.


 
________________________________

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only


Hi, Gary-
This looks like a config problem, and not a code problem yet.  Could you
send the output of mx_info from node-1 and from node-2?  Also, forgive
me counter-asking a possibly dumb OMPI question, but is "-x
LD_LIBRARY_PATH" really what you want, as opposed to "-x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  (I would not be surprised if not
specifying a value defaults to this behavior, but have to ask).
 
Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl
mx,self  (it looks like you did)
 
"[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff " makes
it look like your fabric may not be fully mapped or that you may have a
down link.
 
thanks,
-reese
Myricom, Inc.

        
        

        I was initially using 1.1.2 and moved to 1.2b2 because of a hang
on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so.
My compute nodes are 2 dual core xeons on myrinet with mx. The problem
is trying to get ompi running on mx only. My machine file is as follows
...

        node-1 slots=4 max-slots=4 
        node-2 slots=4 max-slots=4 
        node-3 slots=4 max-slots=4 

        'mpirun' with the minimum number of processes in order to get
the error ... 
                mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi 

        I don't believe there'a anything wrong w/ the hardware as I can
ping on mx between this failed node and the master fine. So I tried a
different set of 3 nodes and I got the same error, it always fails on
the 2nd node of any group of nodes I choose.

node-2.out
Description: node-2.out

node-1.out
Description: node-1.out

Re: [OMPI users] Ompi failing on mx only

Reply via email to