Sorry for the late reply.

Other users have seen something similar but we have never been able to 
reproduce it.  Is this only when using IB?  If you use "mpirun --mca 
btl_openib_cpc_if_include rdmacm", does the problem go away?


On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:

> I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
> when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the 
> collectives hangs go away. I don't know what, if anything, the higher 
> optimization buys you when compiling openmpi, so I'm not sure if that's an 
> acceptable workaround or not.
> 
> My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 
> 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a 
> single iteration of Barrier to reproduce the hang, and it happens 100% of the 
> time for me when I invoke it like this:
> 
> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
> 
> The hang happens on the first Barrier (64 ranks) and each of the 
> participating ranks have this backtrace:
> 
> __poll (...)
> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> opal_progress () from [instdir]/lib/libopen-pal.so.0
> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_barrier_intra_recursivedoubling () from 
> [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> IMB_barrier ()
> IMB_init_buffers_iter ()
> main ()
> 
> The one non-participating rank has this backtrace:
> 
> __poll (...)
> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> opal_progress () from [instdir]/lib/libopen-pal.so.0
> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> main ()
> 
> If I use more nodes I can get it to hang with 1ppn, so that seems to rule out 
> the sm btl (or interactions with it) as a culprit at least.
> 
> I can't reproduce this with openmpi 1.5.3, interestingly.
> 
> -Marcus
> 
> 
> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>> Dear all,
>> 
>> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 
>> with Intel composer XE 2011 (aka 12.0).
>> However we found a threshold in the number of cores (depending from the 
>> application: IMB, xhpl or user applications
>> and form the number of required cores) above which the application hangs 
>> (sort of deadlocks).
>> The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
>> There are any known incompatibilities of openmpi with this version of 
>> intel compiilers?
>> 
>> The characteristics of our computational infrastructure are:
>> 
>> Intel processors E7330, E5345, E5530 e E5620
>> 
>> CentOS 5.3, CentOS 5.5.
>> 
>> Intel composer XE 2011
>> gcc 4.1.2
>> pgi 10.2-1
>> 
>> Regards
>> 
>> Salvatore Podda
>> 
>> ENEA UTICT-HPC
>> Department for Computer Science Development and ICT
>> Facilities Laboratory for Science and High Performace Computing
>> C.R. Frascati
>> Via E. Fermi, 45
>> PoBox 65
>> 00044 Frascati (Rome)
>> Italy
>> 
>> Tel: +39 06 9400 5342
>> Fax: +39 06 9400 5551
>> Fax: +39 06 9400 5735
>> E-mail: salvatore.po...@enea.it
>> Home Page: www.cresco.enea.it
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to