Sorry for the late reply. Other users have seen something similar but we have never been able to reproduce it. Is this only when using IB? If you use "mpirun --mca btl_openib_cpc_if_include rdmacm", does the problem go away?
On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote: > I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only > when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the > collectives hangs go away. I don't know what, if anything, the higher > optimization buys you when compiling openmpi, so I'm not sure if that's an > acceptable workaround or not. > > My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL > 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a > single iteration of Barrier to reproduce the hang, and it happens 100% of the > time for me when I invoke it like this: > > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier > > The hang happens on the first Barrier (64 ranks) and each of the > participating ranks have this backtrace: > > __poll (...) > poll_dispatch () from [instdir]/lib/libopen-pal.so.0 > opal_event_loop () from [instdir]/lib/libopen-pal.so.0 > opal_progress () from [instdir]/lib/libopen-pal.so.0 > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_barrier_intra_recursivedoubling () from > [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 > PMPI_Barrier () from [instdir]/lib/libmpi.so.0 > IMB_barrier () > IMB_init_buffers_iter () > main () > > The one non-participating rank has this backtrace: > > __poll (...) > poll_dispatch () from [instdir]/lib/libopen-pal.so.0 > opal_event_loop () from [instdir]/lib/libopen-pal.so.0 > opal_progress () from [instdir]/lib/libopen-pal.so.0 > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0 > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 > PMPI_Barrier () from [instdir]/lib/libmpi.so.0 > main () > > If I use more nodes I can get it to hang with 1ppn, so that seems to rule out > the sm btl (or interactions with it) as a culprit at least. > > I can't reproduce this with openmpi 1.5.3, interestingly. > > -Marcus > > > On 05/10/2011 03:37 AM, Salvatore Podda wrote: >> Dear all, >> >> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 >> with Intel composer XE 2011 (aka 12.0). >> However we found a threshold in the number of cores (depending from the >> application: IMB, xhpl or user applications >> and form the number of required cores) above which the application hangs >> (sort of deadlocks). >> The building of openmpi with 'gcc' and 'pgi' does not show the same limits. >> There are any known incompatibilities of openmpi with this version of >> intel compiilers? >> >> The characteristics of our computational infrastructure are: >> >> Intel processors E7330, E5345, E5530 e E5620 >> >> CentOS 5.3, CentOS 5.5. >> >> Intel composer XE 2011 >> gcc 4.1.2 >> pgi 10.2-1 >> >> Regards >> >> Salvatore Podda >> >> ENEA UTICT-HPC >> Department for Computer Science Development and ICT >> Facilities Laboratory for Science and High Performace Computing >> C.R. Frascati >> Via E. Fermi, 45 >> PoBox 65 >> 00044 Frascati (Rome) >> Italy >> >> Tel: +39 06 9400 5342 >> Fax: +39 06 9400 5551 >> Fax: +39 06 9400 5735 >> E-mail: salvatore.po...@enea.it >> Home Page: www.cresco.enea.it >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/