Huh; wonky.  

Can you set the MCA parameter "mpi_abort_delay" to -1 and run your job again? 
This will prevent all the processes from dying when MPI_ABORT is invoked.  Then 
attach a debugger to one of the still-live processes after the error message is 
printed.  Can you send the stack trace?  It would be interesting to know what 
is going on here -- I can't think of a reason that would happen offhand.


On Jun 30, 2011, at 5:03 PM, David Warren wrote:

> I have a cluster with mostly Mellanox ConnectX hardware and a few with Qlogic 
> QLE7340's. After looking through the web, FAQs etc. I built openmpi-1.5.3 
> with psm and openib. If I run within the same hardware it is fast and works 
> fine. If I run between without specifying an MTL (e.g. mpirun -np 24 
> -machinefile dwhosts --byslot --bind-to-core --mca btl ^tcp ...) it dies with
> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [n16:9438] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
> ...
> I can make it run but giving a bad mtl e.g. -mca mtl psm,none. All the 
> processes run after complaining that mtl none does not exist. However, they 
> run just as slow (about 10% slower than either set alone)
> 
> Pertinent info:
> On the Qlogic Nodes:
> OFED: QLogic-OFED.SLES11-x86_64.1.5.3.0.22
> On the Mellanox Nodes:
> OFED-1.5.2.1-20101105-0600
> 
> All:
> debian lenny kernel 2.6.32.41
> OpenSM
> limit | grep memorylocked gives unlimited on all nodes.
> 
> Configure line:
> ./configure --with-libnuma --with-openib --prefix=/usr/local/openmpi-1.5.3 
> --with-psm=/usr --enable-btl-openib-failover --enable-openib-connectx-xrc 
> --enable-openib-rdmacm
> 
> I thought that with 1.5.3 I am supposed to be able to do this. Am I just 
> wrong? Does anyone see what I am doing wrong?
> 
> Thanks
> <mellanox_devinfo.gz><mellanox_ifconfig.gz><ompi_info_output.gz><qlogic_devinfo.gz><qlogic_ifconfig.gz><warren.vcf>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to