>From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <dshra...@lanl.gov> wrote:
> Hey Devendar, > > It looks like I still get the error: > > [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out > App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs > [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could > not open the KNEM device file at /dev/knem : No such file or direc > tory. Won't use knem. > [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could > not open the KNEM device file at /dev/knem : No such file or direc > tory. Won't use knem. > [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) > [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) > ==== backtrace ==== > 2 0x0000000000056cdc mxm_handle_error() > > /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h > pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 > > 3 0x0000000000056e4c mxm_error_signal_handler() > > /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro > ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 > > 4 0x00000000000326a0 killpg() ??:0 > 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 > 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 > 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() > coll_ml_module.c:0 > 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 > 9 0x000000000006ace9 hcoll_create_context() ??:0 > 10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0 > 11 0x00000000000f684e mca_coll_base_comm_select() ??:0 > 12 0x0000000000073fc4 ompi_mpi_init() ??:0 > 13 0x0000000000092ea0 PMPI_Init() ??:0 > 14 0x00000000004009b6 main() ??:0 > 15 0x000000000001ed5d __libc_start_main() ??:0 > 16 0x00000000004008c9 _start() ??:0 > =================== > ==== backtrace ==== > 2 0x0000000000056cdc mxm_handle_error() > > /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h > pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 > > 3 0x0000000000056e4c mxm_error_signal_handler() > > /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro > ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 > > 4 0x00000000000326a0 killpg() ??:0 > 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 > 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 > 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() > coll_ml_module.c:0 > 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 > 9 0x000000000006ace9 hcoll_create_context() ??:0 > 10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0 > 11 0x00000000000f684e mca_coll_base_comm_select() ??:0 > 12 0x0000000000073fc4 ompi_mpi_init() ??:0 > 13 0x0000000000092ea0 PMPI_Init() ??:0 > 14 0x00000000004009b6 main() ??:0 > 15 0x000000000001ed5d __libc_start_main() ??:0 > 16 0x00000000004008c9 _start() ??:0 > =================== > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Thanks, > David > > On 08/12/2015 10:42 AM, Deva wrote: > > Hi David, > > This issue is from hcoll library. This could be because of symbol conflict > with ml module. This is fixed recently in HCOLL. Can you try with "-mca > coll ^ml" and see if this workaround works in your setup? > > -Devendar > > On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <dshra...@lanl.gov> wrote: > >> Hello Gilles, >> >> Thank you very much for the patch! It is much more complete than mine. >> Using that patch and re-running autogen.pl, I am able to build 1.8.8 >> with './configure --with-hcoll' without errors. >> >> I do have issues when it comes to running 1.8.8 with hcoll built in, >> however. In my quick sanity test of running a basic parallel hello world C >> program, I get the following: >> >> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out >> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs >> [1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN Could >> not open the KNEM device file at /dev/knem : No such file or direc >> tory. Won't use knem. >> [1439390789.040265] [zo-fe1:31353:0] shm.c:65 MXM WARN Could >> not open the KNEM device file at /dev/knem : No such file or direc >> tory. Won't use knem. >> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault) >> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault) >> ==== backtrace ==== >> 2 0x0000000000056cdc mxm_handle_error() >> >> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >> >> 3 0x0000000000056e4c mxm_error_signal_handler() >> >> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >> >> 4 0x00000000000326a0 killpg() ??:0 >> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 >> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >> coll_ml_module.c:0 >> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >> 9 0x000000000006ace9 hcoll_create_context() ??:0 >> 10 0x00000000000fa626 mca_coll_hcoll_comm_query() ??:0 >> 11 0x00000000000f776e mca_coll_base_comm_select() ??:0 >> 12 0x0000000000074ee4 ompi_mpi_init() ??:0 >> 13 0x0000000000093dc0 PMPI_Init() ??:0 >> 14 0x00000000004009b6 main() ??:0 >> 15 0x000000000001ed5d __libc_start_main() ??:0 >> 16 0x00000000004008c9 _start() ??:0 >> =================== >> ==== backtrace ==== >> 2 0x0000000000056cdc mxm_handle_error() >> >> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h >> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 >> >> 3 0x0000000000056e4c mxm_error_signal_handler() >> >> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro >> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 >> >> 4 0x00000000000326a0 killpg() ??:0 >> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 >> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0 >> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() >> coll_ml_module.c:0 >> 8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0 >> 9 0x000000000006ace9 hcoll_create_context() ??:0 >> 10 0x00000000000fa626 mca_coll_hcoll_comm_query() ??:0 >> 11 0x00000000000f776e mca_coll_base_comm_select() ??:0 >> 12 0x0000000000074ee4 ompi_mpi_init() ??:0 >> 13 0x0000000000093dc0 PMPI_Init() ??:0 >> 14 0x00000000004009b6 main() ??:0 >> 15 0x000000000001ed5d __libc_start_main() ??:0 >> 16 0x00000000004008c9 _start() ??:0 >> =================== >> -------------------------------------------------------------------------- >> >> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited >> on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> >> I do not get this message with only 1 process. >> >> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or >> something with my ompi build? >> >> Thanks, >> David >> >> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote: >> >> Thanks David, >> >> i made a PR for the v1.8 branch at >> <https://github.com/open-mpi/ompi-release/pull/492> >> https://github.com/open-mpi/ompi-release/pull/492 >> >> the patch is attached (it required some back-porting) >> >> Cheers, >> >> Gilles >> >> On 8/12/2015 4:01 AM, David Shrader wrote: >> >> I have cloned Gilles' topic/hcoll_config branch and, after running >> autogen.pl, have found that './configure --with-hcoll' does indeed work >> now. I used Gilles' branch as I wasn't sure how best to get the pull >> request changes in to my own clone of master. It looks like the proper >> checks are happening, too: >> >> --- MCA component coll:hcoll (m4 configuration macro) >> checking for MCA component coll:hcoll compile mode... dso >> checking --with-hcoll value... simple ok (unspecified) >> checking hcoll/api/hcoll_api.h usability... yes >> checking hcoll/api/hcoll_api.h presence... yes >> checking for hcoll/api/hcoll_api.h... yes >> looking for library without search path >> checking for library containing hcoll_get_version... -lhcoll >> checking if MCA component coll:hcoll can compile... yes >> >> I haven't checked whether or not Open MPI builds successfully as I don't >> have much experience running off of the latest source. For now, I think I >> will try to generate a patch to the 1.8.8 configure script and see if that >> works as expected. >> >> Thanks, >> David >> >> On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote: >> >> On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> >> <ake.sandg...@hpc2n.umu.se> wrote: >> >> Please fix the hcoll test (and code) to be correct. >> >> Any configure test that adds /usr/lib and/or /usr/include to any compile >> flags is broken. >> >> +1 >> >> Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some >> comments to it. >> >> >> >> -- >> David Shrader >> HPC-3 High Performance Computer Systems >> Los Alamos National Lab >> Email: dshrader <at> lanl.gov >> >> >> >> _______________________________________________ >> users mailing listus...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27432.php >> >> >> >> >> _______________________________________________ >> users mailing listus...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27434.php >> >> >> -- >> David Shrader >> HPC-3 High Performance Computer Systems >> Los Alamos National Lab >> Email: dshrader <at> lanl.gov >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27438.php >> > > > > -- > > > -Devendar > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27439.php > > > -- > David Shrader > HPC-3 High Performance Computer Systems > Los Alamos National Lab > Email: dshrader <at> lanl.gov > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27440.php > -- -Devendar