I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives 
hangs go away. I don't know what, if anything, the higher optimization buys you 
when compiling openmpi, so I'm not sure if that's an acceptable workaround or 
not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 
5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a 
single iteration of Barrier to reproduce the hang, and it happens 100% of the 
time for me when I invoke it like this:

# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the participating 
ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from 
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems to rule out 
the sm btl (or interactions with it) as a culprit at least.

I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> Dear all,
> 
> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 
> with Intel composer XE 2011 (aka 12.0).
> However we found a threshold in the number of cores (depending from the 
> application: IMB, xhpl or user applications
> and form the number of required cores) above which the application hangs 
> (sort of deadlocks).
> The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
> There are any known incompatibilities of openmpi with this version of 
> intel compiilers?
> 
> The characteristics of our computational infrastructure are:
> 
> Intel processors E7330, E5345, E5530 e E5620
> 
> CentOS 5.3, CentOS 5.5.
> 
> Intel composer XE 2011
> gcc 4.1.2
> pgi 10.2-1
> 
> Regards
> 
> Salvatore Podda
> 
> ENEA UTICT-HPC
> Department for Computer Science Development and ICT
> Facilities Laboratory for Science and High Performace Computing
> C.R. Frascati
> Via E. Fermi, 45
> PoBox 65
> 00044 Frascati (Rome)
> Italy
> 
> Tel: +39 06 9400 5342
> Fax: +39 06 9400 5551
> Fax: +39 06 9400 5735
> E-mail: salvatore.po...@enea.it
> Home Page: www.cresco.enea.it
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Reply via email to