There is something going wrong with the ml collective component. So, if you disable it, things work. I just reconfigured without any CUDA-aware support, and I see the same failure so it has nothing to do with CUDA.
Looks like Jeff Squyres just made a bug for it. https://svn.open-mpi.org/trac/ompi/ticket/4331 >-----Original Message----- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga >Sent: Monday, March 03, 2014 7:32 PM >To: Open MPI Users >Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy >exited with error." > >Dear Rolf, > >your suggestion works! > >$ mpirun -np 4 --map-by ppr:1:socket -bind-to core --mca coll ^ml osu_alltoall ># OSU MPI All-to-All Personalized Exchange Latency Test v4.2 ># Size Avg Latency(us) >1 8.02 >2 2.96 >4 2.91 >8 2.91 >16 2.96 >32 3.07 >64 3.25 >128 3.74 >256 3.85 >512 4.11 >1024 4.79 >2048 5.91 >4096 15.84 >8192 24.88 >16384 35.35 >32768 56.20 >65536 66.88 >131072 114.89 >262144 209.36 >524288 396.12 >1048576 765.65 > > >Can you clarify exactly where the problem come from? > >Regards, >Filippo > > >On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart <rvandeva...@nvidia.com> >wrote: >> Can you try running with --mca coll ^ml and see if things work? >> >> Rolf >> >>> -----Original Message----- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo >>> Spiga >>> Sent: Monday, March 03, 2014 7:14 PM >>> To: Open MPI Users >>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy >>> exited with error." >>> >>> Dear Open MPI developers, >>> >>> I hit an expected error running OSU osu_alltoall benchmark using Open >>> MPI 1.7.5rc1. Here the error: >>> >>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In >>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory >failed In >>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory >>> failed >>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module. >>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited >>> with error. >>> >>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and >>> mpool was not successfully setup! >>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module. >>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited >>> with error. >>> >>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and >>> mpool was not successfully setup! >>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2 >>> # Size Avg Latency(us) >>> --------------------------------------------------------------------- >>> ----- mpirun noticed that process rank 3 with PID 4508 on node >>> tesla51 exited on signal 11 (Segmentation fault). >>> --------------------------------------------------------------------- >>> ----- >>> 2 total processes killed (some possibly by mpirun during cleanup) >>> >>> Any idea where this come from? >>> >>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA >6.0RC. >>> Attached outputs grabbed from configure, make and run. The configure >>> was >>> >>> export MXM_DIR=/opt/mellanox/mxm >>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" >>> -print0) export FCA_DIR=/opt/mellanox/fca export >>> HCOLL_DIR=/opt/mellanox/hcoll >>> >>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 >>> -axAVX -ip - >>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" >>> --prefix=<...> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR >>> --with- mxm=$MXM_DIR --with-knem=$KNEM_DIR --with- >>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with- >>> hwloc=internal --with-verbs 2>&1 | tee config.out >>> >>> >>> Thanks in advance, >>> Regards >>> >>> Filippo >>> >>> -- >>> Mr. Filippo SPIGA, M.Sc. >>> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga >>> >>> <Nobody will drive us out of Cantor's paradise.> ~ David Hilbert >>> >>> ***** >>> Disclaimer: "Please note this message and any attachments are >>> CONFIDENTIAL and may be privileged or otherwise protected from >disclosure. >>> The contents are not to be disclosed to anyone other than the addressee. >>> Unauthorized recipients are requested to preserve this >>> confidentiality and to advise the sender immediately of any error in >transmission." >> >> ---------------------------------------------------------------------- >> ------------- This email message is for the sole use of the intended >> recipient(s) and may contain confidential information. Any >> unauthorized review, use, disclosure or distribution is prohibited. >> If you are not the intended recipient, please contact the sender by >> reply email and destroy all copies of the original message. >> ---------------------------------------------------------------------- >> ------------- _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >-- >Mr. Filippo SPIGA, M.Sc. >http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga > ><Nobody will drive us out of Cantor's paradise.> ~ David Hilbert > >***** >Disclaimer: "Please note this message and any attachments are >CONFIDENTIAL and may be privileged or otherwise protected from disclosure. >The contents are not to be disclosed to anyone other than the addressee. >Unauthorized recipients are requested to preserve this confidentiality and to >advise the sender immediately of any error in transmission." > > >_______________________________________________ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users