Thank you. I am using newest version HPL. I forgot to say I can run HPL with openmpi-3.0 under infiniband. The reason I want to use old version is I need to compile a library that only supports old version openmpi, so I am trying to do this tricky job. Anyways, thank you for your reply Jeff, have a good day.
Kaiming Ouyang, Research Assistant. Department of Computer Science and Engineering University of California, Riverside 900 University Avenue, Riverside, CA 92521 On Mon, Mar 19, 2018 at 8:39 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > I'm sorry; I can't help debug a version from 9 years ago. The best > suggestion I have is to use a modern version of Open MPI. > > Note, however, your use of "--mca btl ..." is going to have the same > meaning for all versions of Open MPI. The problem you showed in the first > mail was with the shared memory transport. Using "--mca btl tcp,self" > means you're not using the shared memory transport. If you don't specify > "--mca btl tcp,self", Open MPI will automatically use the shared memory > transport. Hence, you could be running into the same (or similar/related) > problem that you mentioned in the first mail -- i.e., something is going > wrong with how the v1.2.9 shared memory transport is interacting with your > system. > > Likewise, "--mca btl_tcp_if_include ib0" tells the TCP BTL plugin to use > the "ib0" network. But if you have the openib BTL available (i.e., the > IB-native plug), that will be used instead of the TCP BTL because native > verbs over IB performs much better than TCP over IB. Meaning: if you > specify btl_Tcp_if_include without specifying "--mca btl tcp,self", then > (assuming openib is available) the TCP BTL likely isn't used and the > btl_tcp_if_include value is therefore ignored. > > Also, what version of Linpack are you using? The error you show is > usually indicative of an MPI application bug (the MPI_COMM_SPLIT error). > If you're running an old version of xhpl, you should upgrade to the latest. > > > > > > On Mar 19, 2018, at 9:59 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote: > > > > Hi Jeff, > > Thank you for your reply. I just changed to another cluster which does > not have infiniband. I ran the HPL by: > > mpirun --mca btl tcp,self -np 144 --hostfile /root/research/hostfile > ./xhpl > > > > It ran successfully, but if I delete "--mca btl tcp,self", it cannot run > again. So I doubt whether openmpi 1.2 cannot identify the proper network > interface and set correct parameters for them. > > Then, I return back to the previous cluster with infiniband and type the > same command above. It gets stuck forever. > > > > I change the command to: > > mpirun --mca btl_tcp_if_include ib0 --hostfile > /root/research/hostfile-ib -np 48 ./xhpl > > > > It can successfully launch but gives me errors as follows when HPL tries > to split the communication: > > > > [node1.novalocal:09562] *** An error occurred in MPI_Comm_split > > [node1.novalocal:09562] *** on communicator MPI COMMUNICATOR 3 SPLIT > FROM 0 > > [node1.novalocal:09562] *** MPI_ERR_IN_STATUS: error code in status > > [node1.novalocal:09562] *** MPI_ERRORS_ARE_FATAL (goodbye) > > [node1.novalocal:09583] *** An error occurred in MPI_Comm_split > > [node1.novalocal:09583] *** on communicator MPI COMMUNICATOR 3 SPLIT > FROM 0 > > [node1.novalocal:09583] *** MPI_ERR_IN_STATUS: error code in status > > [node1.novalocal:09583] *** MPI_ERRORS_ARE_FATAL (goodbye) > > [node1.novalocal:09637] *** An error occurred in MPI_Comm_split > > [node1.novalocal:09637] *** on communicator MPI COMMUNICATOR 3 SPLIT > FROM 0 > > [node1.novalocal:09637] *** MPI_ERR_IN_STATUS: error code in status > > [node1.novalocal:09637] *** MPI_ERRORS_ARE_FATAL (goodbye) > > [node1.novalocal:09994] *** An error occurred in MPI_Comm_split > > [node1.novalocal:09994] *** on communicator MPI COMMUNICATOR 3 SPLIT > FROM 0 > > [node1.novalocal:09994] *** MPI_ERR_IN_STATUS: error code in status > > [node1.novalocal:09994] *** MPI_ERRORS_ARE_FATAL (goodbye) > > mpirun noticed that job rank 0 with PID 46005 on node test-ib exited on > signal 15 (Terminated). > > > > Hope you can give me some suggestions. Thank you. > > > > Kaiming Ouyang, Research Assistant. > > Department of Computer Science and Engineering > > University of California, Riverside > > 900 University Avenue, Riverside, CA 92521 > > > > > > On Mon, Mar 19, 2018 at 7:35 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > That's actually failing in a shared memory section of the code. > > > > But to answer your question, yes, Open MPI 1.2 did have IB support. > > > > That being said, I have no idea what would cause this shared memory segv > -- it's quite possible that it's simple bit rot (i.e., v1.2.9 was released > 9 years ago -- see https://www.open-mpi.org/software/ompi/versions/ > timeline.php. Perhaps it does not function correctly on modern > glibc/Linux kernel-based platforms). > > > > Can you upgrade to a [much] newer Open MPI? > > > > > > > > > On Mar 19, 2018, at 8:29 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote: > > > > > > Hi everyone, > > > Recently I need to compile High-Performance Linpack code with openmpi > 1.2 version (a little bit old). When I finish compilation, and try to run, > I get the following errors: > > > > > > [test:32058] *** Process received signal *** > > > [test:32058] Signal: Segmentation fault (11) > > > [test:32058] Signal code: Address not mapped (1) > > > [test:32058] Failing at address: 0x14a2b84b6304 > > > [test:32058] [ 0] /lib64/libpthread.so.0(+0xf5e0) [0x14eb116295e0] > > > [test:32058] [ 1] /root/research/lib/openmpi-1. > 2.9/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x28a) > [0x14eaa81258aa] > > > [test:32058] [ 2] /root/research/lib/openmpi-1. > 2.9/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2b) [0x14eaa853219b] > > > [test:32058] [ 3] /root/research/lib/openmpi-1. > 2.9/lib/libopen-pal.so.0(opal_progress+0x4a) [0x14eb128dbaaa] > > > [test:32058] [ 4] /root/research/lib/openmpi-1. > 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x1d) [0x14eaf41e6b4d] > > > [test:32058] [ 5] /root/research/lib/openmpi-1. > 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x3a5) [0x14eaf41eac45] > > > [test:32058] [ 6] /root/research/lib/openmpi-1. > 2.9/lib/libopen-rte.so.0(mca_oob_recv_packed+0x33) [0x14eb12b62223] > > > [test:32058] [ 7] /root/research/lib/openmpi-1. > 2.9/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x1f9) > [0x14eaf3dd7db9] > > > [test:32058] [ 8] /root/research/lib/openmpi-1. > 2.9/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x31d) > [0x14eb12b7893d] > > > [test:32058] [ 9] /root/research/lib/openmpi-1. > 2.9/lib/libmpi.so.0(ompi_mpi_init+0x8d6) [0x14eb13202136] > > > [test:32058] [10] /root/research/lib/openmpi-1. > 2.9/lib/libmpi.so.0(MPI_Init+0x6a) [0x14eb1322461a] > > > [test:32058] [11] ./xhpl(main+0x5d) [0x404e7d] > > > [test:32058] [12] /lib64/libc.so.6(__libc_start_main+0xf5) > [0x14eb11278c05] > > > [test:32058] [13] ./xhpl() [0x4056cb] > > > [test:32058] *** End of error message *** > > > mpirun noticed that job rank 0 with PID 31481 on node test.novalocal > exited on signal 15 (Terminated). > > > 23 additional processes aborted (not shown) > > > > > > The machine has infiniband, so I doubt whether openmpi 1.2 does not > support infiniband by default. I also try to run it not through infiniband, > but the program can only deal with small size input. When I increase the > input size and grid size, it just gets stuck. The program I run is a > benchmark, so I don't think there would be a problem in the code. Any idea? > Thanks. > > > > > > _______________________________________________ > > > users mailing list > > > users@lists.open-mpi.org > > > https://lists.open-mpi.org/mailman/listinfo/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users